Lenore R. Mullin and James E. Raynolds 



Conformal Computing: 
Algebraically connecting the 
hardware/software boundary 
using a uniform approach to 
high-performance computation 
for software and hardware 
applications 

SPIN Springer's internal project number, if known 

- Monograph - 

March 17, 2008 



Springer 



Contents 



Introduction 

1.1 General Overview 

1.2 Summary of Results: Comparing with Established Library 
Routines 

1.3 Performance of the Radix-2 FFT 

1.3.1 Experiments 

1.3.2 Evaluation of Results 

1.3.3 Origin 2000 Results 

1.3.4 SP2 Results 

1.3.5 SUN SPARCserver lOOOE results 

1.4 Performance of the General- Radix FFT 

1.4.1 General Radix Issues 

1.4.2 Experimental Environment 

1.4.3 General Radix Experiments 

1.4.4 Evaluation of Results 

1.4.5 SPARCserver lOOOE's Results (One 60 MHz Processor) 

1.4.6 Conclusions for the General Radix Approach 

1.5 Effects Due to Specilized Hardware 

1.6 Organization of the Monograph 

Conformal Computing Techniques: A Mathematics of 
Arrays (MoA) and the i/j-calculus 

2.1 Elements of the Theory 

2.1.1 Indexing and Shapes 

2.1.2 Example 

2.1.3 Higher Order Operations 

2.1.4 Definition of 

2.2 Contrasting MoA with Linear Algebra 

2.2.1 Scalars as Arrays 

2.2.2 Need for Empty Arrays 

2.2.3 Graphical Representation 



4 
5 
5 
6 
6 
7 
9 
S 
C 

11 

n 

12 
12 
13 
15 
17 



VI 



Contents 



2.2.4 Notational Subtleties \2£ 

2.2.5 Addition and Multiplication of Arrays: Comparing 

and Contrasting with Linear Algebra [SC 



3 A Cache-Optimized Fast Fourier Transform: Part I 

3.1 Chapter Summary 

3.2 Introduction 

3.3 Index Optimizations for the Traditional FFT 

3.3.1 Traditional FFT Algorithm 

3.3.2 Index Optimization Using the ^-Calculus 

3.3.3 Optimizing Array Access Patterns 

3.4 Cache-Optimized FFT: Key Elements of the New Approach. 
3.4.1 Restructuring the Data: the Reshape- Transpose 



Operation |37 

3.4.2 Re-Ordering of "Butterfly" Operations in the 
Cache-Optimized FFT 

3.4.3 The Number of Reshape- Transpose Operations 

3.4.4 Re-Ordering of the Weights 

3.4.5 V-Reduction 

3.4.6 Structure of the Weight Matrices 

3.4.7 Transforming the Weights 

3.5 Cache-Optimized FFT Illustrated 

3.5.1 Speciflc Examples and Patterns 

3.5.2 Structure of Reshape- Transposed Weight Matrix 

3.5.3 Derivation of N^^'' From the Weight Vector 

3.5.4 Last Step 

3.6 The General Algorithm 

3.6.1 Numbering 

3.6.2 General Weight Sub-Matrices 

3.6.3 Number of Different Weight Matrices 

3.6.4 Number of Copies of a Given Cycle Length 

3.7 Cache-Optimized FFT Using the ^-Calculus 

3.7.1 Reshape Transpose Operation 

3.7.2 ^/i-Reduction of the Reshape- Transpose Operation .... 

3.7.3 Reordering the Data 

3.8 Results and Discussion 

3.9 Conclusions 



A Cache-Optimized Fast Fourier Transform: Part II 

4.1 Chapter Summary 

4.2 Introduction 

4.3 Reshape Transpose 

4.3.1 Algebraic Specification 

4.3.2 ^/j-Reduction: Denotational Normal Form and 
Operational Normal Form [Hc 



Contents VII 



4.3.3 From the ONF to the Generic Design 

4.3.4 ■(/'-Reduction of the Reshape- Transpose Operation . . . . 

4.4 Reordering the Data 

4.4.1 Final Transpose 

4.4.2 General Rule 

4.4.3 ^/i-Reduction 

4.4.4 The Correspondence Theorem (PCT) 

4.4.5 Applying the '0 Correspondence Theorem 

4.4.6 Summary of Final- Transpose Operation 

4.5 Conclusions 

A Cache-Optimized Fast Fourier Transform: Part III 

5.1 Chapter Summary 

5.2 Introduction 

5.3 Extreme Generality of Representation: Contrasting MoA 
with Linear Algebra 

5.3.1 Linear Arrays and Hyper-Cubes 

5.3.2 The Importance of the Array's Shape 

5.3.3 MoA Operator Constructs and the T/j-Calculus 

5.4 FFT in the Hyper-Cube 

5.5 Matrices, Arrays, Hyper-Cubes 

5.6 Derivation 

5.6.1 Reduction of D 

5.6.2 Denotational Normal Form for D 

5.6.3 Reduction of B 

5.6.4 Denotational Normal Form for B 

5.6.5 Reduction of A 

5.6.6 Denotational Normal Form for A 

5.6.7 Final Reduction 

5.6.8 Final Denotational Normal Form 

5.7 Thinking in Hyper-Cubes 

5.7.1 Transpose Formulation of Butterfly 

5.8 Merging the two Derivations 

5.9 Implementation and Performance 

5.10 Conclusion 

Density Matrix Operations for a Quantum Computer 

6.1 Chapter Summary 

6.2 Quantum Computing: Motivation for a Matrix Problem with 
Arbitrary Array Access Patterns 

6.3 Quantum Evolution: the Density Matrix 

6.4 The Algorithm 

6.4.1 Assumptions in the Example: 

6.4.2 Final Expression and Normal Form 

6.5 Conclusion 



IOC 



104 



107 



107 



107 
108 

lie 



111 



113 



113 



VIII Contents 

6.6 Acknowledgments 

7 Conclusions and Grand Challenges .... 

7.1 Conclusions 

7.2 Grand Challenges and Future Directions 

References 



1 



Introduction 



1.1 General Overview 

We present a systematic, algebraically based, design methodology for effi- 
cient implementation of computer programs optimized over multiple levels of 
the processor/memory and network hierarchy. Using a common formalism to 
describe the problem and the partitioning of data over processors and mem- 
ory levels allows one to mathematically prove the efficiency and correctness 
of a given algorithm as measured in terms of a set of metrics (such as pro- 
cessor/network speeds, etc.). The approach allows the average programmer 
to achieve high-level optimizations similar to those used by compiler writers 
(e.g. the notion of tiling). 

The approach is similar in spirit to other efforts using libraries of algorithm 
building blocks based on C-I--I- template classes. In POOMA for example, 
expression templates using the Portable Expression Template Engine (PETE) 
(http://www.acl.lanl.gov.pete) were used to achieve efficient distribution of 
array indexing over scalar operations [ll[2l[3l|4l[5l[6l[71[8l|9l [10] . 

As another example. The Matrix Template Library (MTL) [HIITJ] is a sys- 
tem that handles dense and sparse matrices, and uses template meta-programs 
to generate tiled algorithms for dense matrices. 

For example, the addition of two 2-dimensional arrays A and B 

{A + B),j^A,j+B,j, (1.1) 

can be generalized to the situation in which multi-dimensional arrays are 
selected using a vector of indices v. 

In POOMA A and B were represented as classes, denoting arrays of any 
dimension, and expression templates were used as re-write rules to efficiently 
carry out the translation to scalar operations implied by: 



{A + B)v = A^ + B, 



(1.2) 
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The approach presented in this monograph makes use of A Mathematics of 
Arrays (MoA) [T3j and an indexing calculus (i.e. the tp calculus) to enable the 
programmer to develop algorithms using high-level compiler-like optimizations 
through the ability to algebraically compose and reduce sequences of array 
operations. 

As such, the translation from the left hand side of Eq. 11.21 to the right 
side is just one of a wide variety of operations that can be carried out using 
this algebra. In the MoA formalism, the array expression in Eq. 11.21 would be 
written: 

vip{A + B) =vtl;A + vil;B (1.3) 

where we have introduced the psi-operator ip to denote the operation of ex- 
tracting an element from the multi-dimensional array using the index vector 
(v). 

In this work we demonstrate, in detail, our approach as applied to the 
creation of efficient implementations of the Fast Fourier Transform (FFT) 
optimized over multi-processor, multi- memory/network, environments. Multi- 
dimensional data arrays are reshaped through the process of dimension lifting 
to explicitly add dimensions to enable indexing blocks of data over the vari- 
ous levels of the hierarchy. A sequence of array operations (represented by the 
various operators of the algebra acting on arrays) is algebraically composed 
to achieve the Denotational Normal Form (DNF). The DNF is a semantic 
normal form expressed in terms of cartesian coordinates. Then the DNF is 
transformed into the ONF [Operational Normal Form) which explicitly ex- 
presses the algorithm in terms of loops and operations on indices (i.e. starts, 
stops, and strides are explicitly indicated) . This reduction is based on an index 
calculus we call the ■0-calculus. The ONF can thus be directly translated into 
efficient code in the language of the programmer's choice be it for hardware 
or software application. 

The algebra we use is a Universal Algebra based on Sylvester's work [T3] 
which was later embodied in the programming language APL. The algebra of 
MoA is a functional, enhanced subset of APL. We emphasize, however, that 
MoA is not a programming language but rather is a mathematical theory. 
Also it is important to note that APL does not have an index calculus 
similar to the ■0-calculus. The notion of an index calculus was suggested by 
Abrams in 1970 [U] and it was extended and closed by MuUin [T5] . 

The application we use as a first demonstration vehicle - the Fast Fourier 
Transform - is of significant interest in its own right. The Fast Fourier Trans- 
form (FFT) is one of the most important computational algorithms and its use 
is pervasive in science and engineering. This work traces the development and 
refinement of efficient implementations of the FFT including one in which the 
FFT was optimized in terms of in-cache operations leading to factors of two 
to four speedup in comparison with our previous records. [16] Further back- 
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ground material including comparisons with library routines can be found in 
Refs. [T71 [THl HH HH] and [IS], and will be reviewed in a following section. 

Our approach can be seen to be a generalization of similar work aimed at 
out-of-core optimizations |21| . Similarly, tensor decompositions of matrices (in 
general) are special cases of our reshape-transpose design. Most importantly, 
our designs are general for any partition size, i.e. not necessary blocked in 
squares, and any number of dimensions. Furthermore, our designs use linear 
transformations from an algebraic specification and thus they are verified. 
Thus, by specifying designs (such as Gormen's and others [3T]) using these 
techniques, these designs too could be verified. 

In the context of the cache-optimized algorithm we DO NOT attempt 
any serious analysis of the number of cache misses incurred by the algorithm 
in the spirit of of Hong and Kung and others [22, 23l[24]. Rather, we present an 
algebraic method that achieves (or is competitive) with such optimizations 
mechanically. Through linear transformations we produce a normal form, 
the ONF, that is directly implementable in any hardware or software language 
and is realized in any of the processor/memory levels [25]. Most importantly, 
our designs are completely general in that through dimension lifting we can 
produce any number of levels in the processor/memory hierarchy. 

One objection to our approach is that one might incur an unacceptable 
performance cost due to the periodic rearrangement of the data that is of- 
ten needed. This will not, however, be the case if we pre- fetch data before it 
is needed. The necessity to pre-fetch data also exists in other similar cache- 
optimized schemes. Our algorithm does what the compiler community calls 
tiling. Since we have analyzed the loop structures, access patterns, and speeds 
of the processor/memory levels, pre-fetching becomes a deterministic cost 
function that can easily be combined with reshape-transpose or tiling opera- 
tions. 

Again we make no attempt to optimize the algorithm for any particular 
architecture. We provide a general algorithm in the form of an Operational 
Normal Form that allows the user to specify the blocking size at run time. 
This ONF therefore enables the individual user to choose the blocking size 
that gives the best performance for any individual machine, assuming this 
intentional information can be processed by a compilei0. 

It is also important to note the importance of running reproducible 
and deterministic experiments. Such experiments are only possible when 
dedicated resources exist AND no interrupts or randomness effects mem- 
ory/cache/communications behavior. This means that multiprocessing and 
time sharing must be turned off for both OS's and Networks. 

Conformal Computing is the name given by the authors to this algebraic 
approach to the construction of computer programs for array-based compu- 



^ Processing intentional information will be the topic of a future paper 
^ The name Conformal Computing © is protected. Copyright 2003, The Research 
Foundation of State University of New York, University at Albany. 



4 1 Introduction 



tations in science and engineering. The reader should not be misled by the 
name. Conformal in this sense is not related to Conformal Mapping or similar 
constructs from mathematics although it was inspired by these concepts in 
which certain properties are preserved under transformations. In particular, 
by Conformal Computing we mean a mathematical system that conforms as 
closely as possible to the underlying structure of the hardware. Details of 
the theory including discussion of MoA and V'-calculus are provided in the 
following chapter. 

The bulk of the material in this monograph is devoted to the FFT. In the 
fourth chapter a generalized approach is presented based on the notion of a 
hyper-cube data structure. In the fifth chapter we extend this approach to 
consider and streamline certain key computational steps in a density-matrix 
based algorithm for simulation of quantum computers. 



1.2 Summary of Results: Comparing with Established 
Library Routines 

In the following sections we review previous success of the approach in 
which our routines were found to be competitive with, or outperformed well- 
established library routines. Further background can be found in Refs. |17|, 
[HIIISIEO] and [ig. 

We focus on the performance of the code fragment listed in Fig ll.ll 



do q = l,t 

L = 2**q 

do row = 0,L/2-l 

weight (row) = EXP( (2*pi*i*row) /L) 
end do 

do col' = 0,n-l,L 
do row = 0,L/2-l 

c = weight (row) *x(col +row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 



Fig. 1.1. Radix 2 FFT with in-place butterfly computation from Ref. [19]. We refer 
to this code fragment as fftpsirad2 in comparisons with other routines. 
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This piece of code is the radix-2 version of a general-radix algorithm de- 
veloped and tested in Ref. 19J. In the following comparisons, we refer to this 
routine as jjtpsirad2. 

1.3 Performance of the Radix-2 FFT 

Our radix 2 experiments were run on three dedicated systems: 

1. a SGI/Cray Origin2000 at NCS^l in Illinois, with 48, 195Mhz RIOOOO 
processors, and 14GB of memory. The LI cache size is 64KB (32KB Icache 
and 32 KB Dcache). The Origin2000 has a 4MB L2 cache. The OS is IRIX 
6.5. 

2. an IBM SP2 at the MAUI High Performance Computing Centei@, with 
32, P2SC 160Mhz processors, and 1 GB of memory. The LI cache size is 
160KB (32KB Icache and 128KB Dcache), there is no L2 cache. The OS 
is AIX 4.3. 

3. a SUN SPARCserverlOOOE, with 2, 60Mhz processors, and 128MB of 
memory. Its LI cache size is 36KB (20KB Icache and 16KB Dcache) and 
it is one-way set associative. The OS is Solaris 2.7. 

We tested against the FFTW on all three machines and against the math 
libraries supported on the IBM SP2 and the Origin 2000 machines; on the 
Origin 2000: IMSL Fortran Numerical Libraries version 3.01, NAG version 
Mark 19, and SGI's SCSL library, and on the SP2: IBM's ESSL library. 

1.3.1 Experiments 

Experiments on the Origin 2000 were run using bsub, SGI's batch processing 
environment. Similarly, on the SP2 experiments were run using loadleveler. 
In both cases we used dedicated networks and processors with ALL 
multiprocessing and time sharing DISABLED. The SPARCserver lOOOE 
was a machine dedicated to running our experiments. Experiments were re- 
peated a minimum of three times and averaged for each vector size (2"^ to 
2^**). Vendor compilers were used with the -03 and -Inline flags, for im- 
proved optimizations (i.e. no special flags, noting the combinatorial explosion 
of combinations of flags one might use in an ad hoc approach). We used Perl 
scripts to automatically compile, run, time all experiments, and to plot our 
results for various problem sizes. 
We believe that: 



^ This work was partially supported by National Computational Science Alliance, 

and utilized the NCSA SCI/CRAY Origin2000 
* We would like to thank the Maui High Performance Computing Center for access 

to their IBM SP2. 
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"the only consistent and reliable measure of performance is the ex- 
ecution time of real programs, and that all proposed alternatives to 
time as the metric or to real programs as the items measured have 
eventually led to misleading claims or even mistakes in computer de- 
sign." [26] 

Therefore, we time the execution of the entire executable, which includes 
the creation of the FFTW plan. Although FFTW claims a plan can be reused, 
the plan is a data structure and not easily saved. Therefore, they have devel- 
oped a utility called wisdom, which still requires partial plan regeneration. 

"FFTW implements a method for saving plans to disk and restoring 
them. The mechanism is called wisdom. There are pitfalls to using 
wisdom, in that it can negate FFTW's ability to adapt to changing 
hardware and other conditions. Actually, the optimal plan may change 
even between runs of the same binary on identical hardware, due to 
differences in the virtual memory environment. It is therefore wise to 
recreate wisdom every time an application is recompiled." [27j 

This is further confirmed by the fact that when we ran FFTW's test pro- 
gram on our dedicated machines we found surprising differences in time for 
the same vector size. Our dedicated machines ran a default OS, e.g. no special 
quantum, queue settings, or kernel tuning. Due to this, a plan should be as 
current as possible. 

IMSL, NAG, SCSL, and ESSL required no plan generation. The entire 
execution of code created to use each library's FFT was timed. 

1.3.2 Evaluation of Results 

Our results on the Origin 2000 and the SP2, shown in Figs. 11.2] and ll.3| and 
Tables ITTT] and [L2l indicate a performance improvement over our competition 
in a majority of cases. 

1.3.3 Origin 2000 Results 

Performance results for our monolithic FFT code, JJtpsirad2, indicate a dou- 
bling of time when the vector size is doubled for all vector sizes. IMSL doubled 
its performance up to 2^^. At 2^^ there is an apparent memory change causing 
a 400% degradation in performance. For NAG this degeneration begins at 2^^. 
The SGI library, SCSL, is apparently doing very machine specific optimiza- 
tions, perhaps out of core techniques similar to Gormen j2T] as evidenced by 
nearly identical performance times for 2^^ and 2^^. 

Against the FFTW, we achieved a performance improvement for vector 
sizes greater than 2^^. We ran slightly slower for the vectors 2^ through 2^^, 
with a maximum difference in speed of 0.013 seconds. 
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Both programs were able to run for the vector size 2^* (inputs of 2^^ 
and greater, result in stack fram,e larger than system limit error on compile). 
Therefore, we achieved a large performance increase for the upper vector sizes, 
while our performance remained competitive for the lower sizes. 





Origin 


2000 




Size 


fftpsirad2 


IMSL 


NAG 


SCSL 


FFTW 


2^ 


0.190 


0.064 


0.010 


0.065 


0.013 


2" 


0.018 


0.061 


0.010 


0.047 


0.013 


2^ 


0.018 


0.062 


0.010 


0.065 


0.014 


2'' 


0.017 


0.116 


0.011 


0.073 


0.013 


2" 


0.019 


0.063 


0.010 


0.068 


0.015 


2» 


0.018 


0.062 


0.011 


0.105 


0.014 


2'^ 


0.017 


0.122 


0.011 


0.069 


0.014 


2"' 


0.021 


0.065 


0.013 


0.056 


0.015 


2^1 


0.021 


0.064 


0.016 


0.058 


0.017 


212 


0.021 


0.067 


0.023 


0.067 


0.023 


213 


0.022 


0.075 


0.036 


0.065 


0.030 


2l4 


0.024 


0.144 


0.065 


0.066 


0.051 


2l5 


0.030 


0.120 


0.135 


0.110 


0.082 


2l6 


0.040 


0.209 


0.296 


0.080 


0.189 


2^^ 


0.065 


0.335 


0.696 


0.072 


0.395 


2i8 


0.126 


0.829 


3.205 


0.075 


0.774 


2i9 


0.238 


3.007 


9.538 


0.096 


2.186 


220 


0.442 


9.673 


18.40 


0.143 


4.611 


221 


0.884 


23.36 


38.93 


0.260 


9.191 


222 


1.910 


46.70 


92.75 


0.396 


19.19 


223 


4.014 


109.4 


187.7 


0.671 


48.69 


224 


7.550 


221.1 


442.7 


1.396 


99.10 



Table 1.1. Real Execution Times (seconds) of fftpsirad2, comparative libraries and 
FFTW on NCSA's SGI/CRAY Origin 2000. 



1.3.4 SP2 Results 

Our routine fftpsirad2 outperforms ESSL for vector sizes up to 2^^, except 
for two cases. For 2^^ and 2^^ our performance is nearly identical. Our times 
continue to double from 2^^ to 2"^^. Comparatively, for 2^^ through 2^^ ESSL 
appears to be doing some machine specific optimizations, which subsequently 
double in performance through 2^^, and fail for 2^^ and higher. Consequently, 
we perform competitively for all but five vector sizes, and we are able to 
process vectors up to 6,291,456 elements larger. 
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Origin 2000 Performance improvement 




Fig. 1.2. Percent improvement of frtpsirad2 over comparative libraries and FFTW 
on NCSA's SGI/CRAY Origin 2000. 



Against the FFTW, we achieved a performance improvement for every 

vector size from 2^ through 2^^, except for the single vector size 2-'^^. Due 
to our optimizations, our code, fflpsirad2, ran successfully for the vector size 
2^^ (inputs of 2^^ and greater, result in a not enough memory available to 
run error at runtime) . The largest size vector the FFTW was able to run on 
this system was 2^^. Therefore we achieved a performance increase for every 
size except one, and we were able to run successfully for a vector 4,194,304 
elements larger. 



-2 Performance Improvement 




Log2 of the Vector Size 



Fig. 1.3. Percent improvement of frtpsirad2 over ESSL and FFTW on Maui HPCC 
IBM SP2. 
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IBM' 


3 SP2 




Size 


fftpsirad2 


ESSL 


FFTW 


2^ 


0.010 


0.013 


0.020 


2* 


0.010 


0.053 


0.020 


2^ 


0.010 


0.013 


0.020 


26 


0.010 


0.013 


0.020 


2' 


0.010 


0.013 


0.027 


2» 


0.010 


0.016 


0.023 


2^ 


0.010 


0.010 


0.020 


2io 


0.010 


0.013 


0.023 


2" 


0.013 


0.010 


0.023 


2l2 


0.020 


0.020 


0.023 


2i3 


0.033 


0.030 


0.030 


2l4 


0.040 


0.043 


0.053 


2i5 


0.070 


0.060 


0.083 


2i6 


0.140 


0.120 


0.143 


2" 


0.280 


0.160 


0.283 


2i8 


0.580 


0.310 


0.683 


2i9 


1.216 


0.610 


1.456 


220 


2.580 


1.276 


3.273 


221 


5.553 


3.663 


6.770 


222 


12.12 


Failed 


15.553 


223 


25.25 


Failed 


Failed 



Table 1.2. Real Execution Times (seconds) of fftpsirad2 and ESSL on Maui HPCC 
IBM SP2. 



1.3.5 SUN SPARCserver lOOOE results 

Our results on the SPARCserver lOOOE, shown in Fig. 11.41 achieved perfor- 
mance improvement for every vector size except three. AdditionaUy, we were 
able to run for the vector size 2^** (inputs of 2^^ and greater, result in an 
integer overflow error on compile). The FFTW failed for 2^^, and for 2^^ it 
ran for over 34 hours. 



1.4 Performance of the General-Radix FFT 

In this and following sections we review the performance of our general radix 
algorithm, developed and tested in Ref . pj] . 

We've discussed how to design and build a faster radix 2 FFT, and the 
results we've achieved. Using the same theoretical framework we can extend 
this design and subsequent implementations to handle any radix. Given a 
particular architecture and its memory hierarchy, a radix other than 2 may 
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SparcServerl 000 Performance Improvement 




2 4 6 8 10 12 14 16 18 20 22 

Log2 of fhe Vecfor Size 



Fig. 1.4. Percent improvement of fftpsirad2 over FFTW on SPARCserver lOOOE. 



be more appropriate. For example, if a machine has an n-way associative 
cache, then radix-(na) should be used (with a > 1), thus decreasing control 
time. 

By generalizing the algorithm, we can investigate how a change in radix af- 
fects performance without changing the algorithm and subsequent executable. 
In what follows, we refer to our radix-n FFT as fftpsiradn. Fig. 11.51 illustrates 
our F90 software realization for the body of the radix 2 butterfly. Observe that 
the variable c is a scalar and ebase is implicitly calculated based on radix 
2, i.e. 1 and -1. Also, observe that the butterfly is explicitly done using F90 
syntax. In the generalized code. Fig. II. 6( c becomes a vector whose length is 
equal to the (radix - 1). Hence for radix 2, c is a one component vector, the 
variable base (see Fig. II. 6p . is used to designate the radix desired. 



c = ww(j_)*z(i_+j_+L) 
z((/ , i_+j_+L /)) = 

(/ z(i_+j_)+c , z(i_+j_)-c /) 



Fig. 1.5. Fragment from Radix 2 butterfly F90 code 



1.4.1 General Radix Issues 

When a vector has length 2", a radix 2 FFT may be used. Whenever the same 
vector is a power of 4, 8, 16, etc. that radix FFT may also be used: i.e.: when 
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c(start) = z(i_+j_) 

ebase (start) = 1 

do k_=start+l,base-l 

c(k_) = WM(j_*k_)*z(i_+j_+(k_*L)) 

z(i_+j_) = z(i_+j_)+c(k_) 

ebase (k_) =EXP( (-2*pi*i) /base) **k_ 

end do 

do k_=start+l ,base-l 
temp = c (start) 
do l_=start+l ,base-l 
temp = temp+c(l_)* 

ebase (modulo ((k_*l_) ,base)) 

end do 

z (i_+ j _+ (k_*L) ) =temp 
end do 



Fig. 1.6. Fragment from Radix n F90 code. 

2" = (2™)' where n = m + l. For example, 2^ = 64 = (2^)2 = 8^ = (2^)3 = 4^. 
Hence, for a vector length 64, radix 2, 4, 8 or 64 may be used. Depending on 
the size of the cache, the number of cache lines, associativity of the cache, etc. 
one radix may perform better on one architecture over another. The reason 
for this is when radix 2 is used on an input vector of length n, the butterfly 
takes two inputs and requires log2n iterations. 

For radix 4, there are four inputs log4n iterations, etc. Consequently, the 
butterfly width and the number of elements becomes the deciding factor in 
performance. Using one executable increases flexibility across machines. 

Incidentally, the implications of an n-way n-radix FFT is that when a 
quantum machine is built, i.e. when all bits can interact with all bits at the 
same time, this algorithm will scale to that machine. 

1.4.2 Experimental Environment 

Our general radix experiments were run on two dedicated systems: 

1. a SUN SPARCserverlOOOE. This machine has two 60Mhz processors, and 
128MB of memory. Its LI cache size is 36KB (20KB Icache and 16KB Dcache) 
and it is one-way set associative. The OS is Solaris 2.7. 

2. a SUN 20. This machine has a single 50 Mhz processor, and 64 MB of memory. 
The OS is Solaris 2.7. 

1.4.3 General Radix Experiments 

The SPARCserver lOOOE and the SUN 20 were both machines dedicated to running 

our experiments. We ran our experiments for a variety of different radices, given the 
following constraints on the input vector length: 2" = (2™) , n = m + l, n,Tn,l € , 
1 < rn < 8. lu determining what results to discuss and represent in our graphs, wo 
removed the following radices: 128, and 256. Their non-competitive performance 
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results is plausibly a consequence of no machines having a cache with 128 or 256 
way associativity. If this changes in the future, our executable could adapt. Table 3 
illustrates performance times given a vector size 16,777,216, using an FFT radix of 
2, 4, 8, 16 and 64. 

1.4.4 Evaluation of Results 

We now turn to a comparison of the general radix algorithm of various platforms as 
illustrated in Figs. [LT] through [TTO] 

Radix 8 performed best on the SUN 20 with a maximum vector size of 16,777,216. 
The performance of radix 2 and 4 experiments were impacted greatly by the number 
of page faults. This is, due to the fact that radix 4 performs twice as many iterations 
as radix 8, and radix 2 performs four times as many as radix 8. Each iteration will 
have about the same number of page faults as the previous, due to the fact that 
every sample is accessed on every iteration. The number of page fauhs recorded for 
these experiments is increased by almost these same factors. 

Radix 8 has twice as many iterations as radix 16, and as expected the number of 
page faults is higher, but radix 8 outperforms radix 16. We conjecture that this is due 
to the large number of cache misses occurring at radix 16 and higher, as can be seen 
by an increase in user time in Table 4. The cache misses are occurring at this higher 
vector size because 16 samples are being processed at the same time rather than 8. 
Either all 16 components don't fit in cache or many are mapping to the same cache 
block, hence they must wait to be loaded. Therefore, although there are fewer page 
faults, the higher number of cache misses are causing radices greater than or equal 
to 16 to incur performance degradation. Notice when comparing the SPARCserver 
lOOOE one processor performance to the SUN 20 for vector sizes < 2^*, the graphs 
are nearly identical. Performance also appears similar from 2^* through 2^^. Since 
the SUN 20's memory was half the size of the SPARCserver lOOOE's we observe 
virtual behavior sooner. More memory is available on the SPARCserver lOOOE even 
though we are using one processor. 

1.4.5 SPARCserver lOOOE's Results (One 60 MHz Processor) 
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9318 


779628 
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5198 


411978 
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35 


9693 


8 


3649 


289973 


35 


43 


8381 


16 


2903 


237271 


26 


52 


8890 


64 


2582 


195693 


13 


66 


16040 



Table 4: Paging Statistics for vector size 16,777,216 as Reported by SPARCworks, 

times are in Seconds 

Notice the steady decrease in the number of page faults as the radix increases, 
this is caused by the number of iterations decreasing by a factor of two as each 
radix increases by a factor of two. As expected the amount of time the application 
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Fig. 1.7. Comparison of radices performance on the SUN SPARCserver lOOOE: One 
Processor 
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Fig. 1.8. Enlarged image of Fig. 11.71 lower left corner. 



spends in page faulting is directly related to the radix. Equally important is the 
steady increase in the percentage of user time as the radix increases, thus implying 
an increase in cache misses. 



1.4.6 Conclusions for the General Radix Approach 

We have succeeded in simplifying a solution to a complex problem: faster and bigger 
FFTs. Reproducible experiments indicate that our designs outperform all tested 
packages in either all, or a majority of the cases, while remaining competitive in the 
others. 
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Fig. 1.9. Comparison of Radices performance on SUN 20. 
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Fig. 1.10. Enlarged image of Fig II. lOl lower left corner. 



Our single, portable, scalable executable of approximately 2,600 bytes (which 
can be easily built in software, hardware, or both) also must be compared with the 
large suite of machine-specific software required by NAG, IMSL, SCSL, ESSL, and 
FFTW. For example, FFTW's necessary libraries, codelets, etc. total approximately 
7,120,000 bytes. 

The FFTW install is potentially complicated. It requires knowledge of makefiles, 
compiler options, and hand-tuning to work around variants of OSs. For example, 
the FFTW install fails with the default Solaris cc compiler, requiring a path change 
to Solaris's SUNWspro cc compiler. The user must know that a new plan is needed 
for every vector size and they need to decide whether to attempt to use wisdom to 
save a plan. 
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The time and investment it takes to create a scientific library also needs to be 
factored into an analysis of the software. New machine designs require the repro- 
gramming of these large libraries in order to maintain their performance. This will 
result in an increase of the library size as new machines are added. Also, the skill 
level of the scientific programmer who supports these libraries, and their knowl- 
edge of each machine is very high. The user of these machine-specific libraries must 
have a deep understanding of the package itself and the system on which it runs. 
Many of these packages require numerous parameters; fifteen for ESSL, and eight 
for SCSL. An incorrect decision for these parameters can result in poor performance 
and even possibly incorrect results. Consequently, the learning period to adapt to 
new software usage can be enormous from a user perspective. 

In comparison, our monolithic algebraic approach to design is based on mecha- 
nizable linear transformations, which begin with a high-level specification, e.g. MAT- 
LAB, Mathematica, Maple, etc. We have found that scientists prefer these high-level 
languages to rapidly prototype their designs 28 , as opposed to interfacing with pro- 
grammers. Consequently, it is imperative that languages such as MATLAB compile 
to efficient code. This would necessitate such languages identifying a functional sub- 
set that embodied MoA and Psi Calculus. 

Our monolithic approach addresses these issues. Our approach has no learning 
curve, a single executable can be used independent of vector size or machine. Ad- 
ditionally, a naive user of fftpsirad2 and fftpsiradn need only know the vector size 
on any platform. Furthermore, our monolithic algebraic approach to design is ex- 
tensible to cache and other levels of the memory hierarchy, as well as parallel and 
distributed computing ,25] . 

We have discovered that radix 2 is not always the best radix to choose for vectors 
whose length is a power of two. This opens other interesting questions: is it better 
to pad to radix 2 or use another radix'lj We believe our research into a general radix 
formulation may yield further optimizations for FFTs. 

Our research shows a systematic way of analyzing memory access patterns and 
subsequent performance of the FFT on one dimensionajf] arrays whose length is 2" . 

Besides determining the constituent components of the sequential FFT, e.g. bit 
reversal or butterfly, the radix used is of paramount importance. By developing 
algorithms for software, e.g., ffipsiradn, we are guaranteed that all instructions, 
loops and variables remain constant during the execution of the program. Hence, 
through monolithic analysis we can in concert, analyze the algorithm, 
the program, and the environment. 



1.5 Effects Due to Specilized Hardware 

In this section we review published work in which the effects due to the use of 
specialized hardware are illustrated |16j . Speciflcally we compare our routine with 
one that was specifically designed to exploit hardware with the capability to carry 
out the operation of multiply and add in one step. We find that our generic radix-2 
FFT (see Fig. II. ip is competitive on such hardware and is superior on a machine 
lacking the specialized hardware. 

* Padding is the usual way of handling vector lengths not equal to 2" 
^ Multi-dimensional FFTs may be built from the one dimensional FFT. 
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The study presented in Ref. [TB] is a benchmark of our radix-2 FFT fftpsirad2 [T5] 
in the context of the plane- wave based electronic structure code Abinit (2^. Such 
codes rely heavily on the use of the FFT and considerable work has been expended 
to optimize performance [5D]. Therefore this context serves as a further stringent 
benchmark test of our routine. The purpose of this work was NOT to claim that 
we have the best FFT. Rather we emphasize that our approach leads to efficient 
code based on general principles without any hardware-specific optimization (which 
naturally can be included as a further refinement). 

We now turn our attention to a study of the one-dimensional FFT. In compar- 
isons between our routine, which we denote as the "CC" routine, we will refer to 
the Abinit routine as the "Ab" [29l[30] . 

We have carried out benchmark performance tests of our 1-dimensional FFT in 
comparison with the routine taken from the Abinit code [291 130| using the same 
platforms as were used in previous tests: the Origin 2000 at NCSA [31] and the IBM 
SP2 at Maui [191 132) . The benefit of our focus on the 1-dimensional transform, lies 
in the fact that it serves as a control test for later developments (multi-dimensional 
arrays and multi-level, multi-processor memories). 

The following tests were carried out. We ran both routines as single processor 
jobs with the -03 compiler option (f90 compiler on NCSA, and xlf on Maui) in 
a dedicated environment. For comparison, therefore, the 3-dimensional transform 
of Ref. l30l was run as a 1-dimensional transform by considering arrays of size 
{N, 2, 2)," (2, iV, 2), and (2, 2, TV) where N = 2" with n = 1, 2, 3, • • ■. Slightly better 
performance was obtained for the {N, 2, 2) case so these are the results that we will 
use for comparisons with the CC routine. Perl scripts were used to compile the jobs 
for each size, time the execution, and collect the results. The timings from the runs 
involving the 3-dimensional routine were divided by a factor of 4 to account for the 
fact that two of the 3 dimensions were held fixed with the value 2 (dimensions of 
length 1 are not permitted in the Ab routine). 

Qualitatively different results were obtained for the same tests carried out on 
the two different platforms emphasizing the important role played by compilers 
and hardware. The Abinit routine was specifically designed to utilize performance 
improvements through the use of the (specialized hardware) Multiply- Add Instruc- 
tion [30]. As mentioned in Ref. [30] this gives the Abinit routine an advantage on 
Maui which is not available on NCSA. This behavior is illustrated in Fig. 11.111 (a) 
in which the Abinit routine clearly is faster and can handle larger sizes on the SP2 
(Maui) in comparison to the Origin (NCSA). We also find the Abinit routine out- 
performs the CC routine (which makes no use of the specialized hardware) for large 
systems on the SP2 at Maui as illustrated in Fig. 11.111 (b) . For small systems, the 
CC routine is quite competitive. 

In contrast, however, the simple and efficient design of the CC routine (see 
Fig. II. 1[) allows it to greatly outperform the Ab routine on the Origin (NCSA) for 
which the Ab routine cannot take advantage of the specialized hardware. Fig. 11.121 
(a) illustrates this behavior. For small systems, both routines have similar slopes 
with a nearly constant offset most likely due to startup costs. For larger systems, 
however, the CC routine clearly wins. 

Lastly by comparing the behavior of the CC routine (which is expected to make 
use of similar hardware on both machines) we can conjecture relationships between 
hardware and performance. Figure 11.121 (b) illustrates the performance of the CC 
routine on the SP2 at Maui vs. the Origin 2000 at NCSA. Apparently the SP2 is 
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Fig. 1.11. Comparison of the Ab routine (Abinit) and the CC routine on the Origin 
(NCSA) vs. the SP2 (Maui), (a) The Ab routine has a performance advantage on 
the SP2 at Maui in comparison to the Origin 2000 at NCSA (b) The Ab routine 
out-performs the CC routine on the SP2 for large systems but for small systems, 
the CC routine is quite competitive. 



faster but has less memory than the Origin. This conclusion follows from the fact 
that the slopes are similar for large systems but the turn-over point (where the slope 
begins to increase) occurs earlier for the SP2. This is an important point: changes in 
performance (i.e. in slope) correspond to various cache, memory, and virtual memory 
boundaries. 

The tests presented in this section are consistent with previous findings [191 1171 
I18| in that our routine, constructed based on Conformal Computing techniques, is 
competitive with other well tuned code despite the absence of any special optimiza- 
tions such as cache loops, or the reliance on a specific piece of hardware, etc. By 
observing the behavior of timing vs. size (changes in slope etc.) we are able to iden- 
tify various details of the hardware (boundaries on cache, memory, virtual memory, 
etc.). Such information can be used for further refinement of the design using the 
general principles of Conformal Computing. 



1.6 Organization of the Monograph 

In this chapter we have introduced considerable background material on our ap- 
proach and given ample material demonstrating competative to superior perfor- 



18 1 Introduction 




Fig. 1.12. Performance of the Conformal Computing routine (CC) on both plat- 
forms and with the Abinit (Ab) routine on the Origin 2000 at NCSA, (a) The CC 
routine outperforms the Ab routine for very large systems and is competitive for 
small systems, (b) Similar results are found for the CC routine on both platforms 
for large systems. Apparently the SP2 is faster but has less memory than the Origin. 



mancc of resulting implementations. The remainder of this monograph is organized 
as follows. 

In the following two chapters, the reader is introduced to the algebraic formalism 
and important similarities and differences with standard linear algebra axe emphar 
sized. The following three chapters presents never-before-published work devoted to 
the development of a cache-optimized FFT. The first of those presents our approach 
in a manner that bridges the gap between the language of standard linear algebra 
and the methods of Conformal Computing. The second of the three, presents the 
same problem in full detail using the machinery of Conformal Computing. The third 
of these expands and illustrates the algorithm emphasizing the natural role played 
by the hypor-cubo data structure. The following chapter builds on this hyper-cube 
view with an application to simulation of quantum computers. The final chapter 
concludes this review with a discussion of important related developments in this 
newly-emerging field along with a discussion of next steps and the proposal of a 
number of Grand Challenges. 
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Conformal Computing Techniques: A 
Mathematics of Arrays (MoA) and the 
i/?-calculus 



In this chapter we introduce the two cornerstones of the Conformal Computing 
approach: (1) A Mathematics of Arrays (MoA) and (2) the ?/j-calculus. MoA is an 
algebra of multi-dimensional monolithic arrays that subsumes the notions of matrix, 
array, tensor, etc., of traditional mathematical and computational approaches. Array 
expressions are manipulated through the use of a collection of operators which are 
defined in this chapter. Expressions are manipulated by linear transformations that 
compose indices using structural information of arrays (i.e. shapes) and operations 
on arrays (e.g. irmer product, outer product, etc.). The consequence of those linear 
transformations is an optimized normal form. This is the essence of the ?/>-calculus. 

The first step is a cartesian normal form: the Denotational Normal Form (DNF) 
which gives the semantics of what to do but not how optimally build the code. From 
the DNF, the process we call V-reduction leads to the Operational Normal Form 
(ONF) which is an explicit recipe for how to build the code. As such it describes all 
loops and iteration structures in terms of starts, stops and strides. As such, the ONF 
contains specific information regarding the algorithm and data layout on a particular 
architecture in terms of memory, processor, and network layout. It can thus be 
directly translated into computer code in any hardware or software language of 
choice. That is, the ONF is a specification given: Iteration, Sequence, and Control 
for each level of processor/memory hierarchy desired. These three FUNDAMENTAL 
issues are the basis of ALL architecures and thus, this is the most abstract way to 
define them. 

This chapter introduces the necessary techniques with some examples. The rest 
of this monograph extends and illustrates the use of these techniciues in the context 
of previously unpublished work devoted to the development of a cache-optimized 
FFT and for an application to the simulation of a quantum computer. 
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2.1.1 Indexing and Shapes 

The ip operator is central to MoA and is used as follows. We write: 



(2.1) 
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to denote the operation in which a vector of n integers, p, is used to select an item 
of the n-dimensional array A. The operation is generahzed to select a partition of 
A, so that if q is a vector having only k < n components then, q_i^A, is an array of 
dimensionality n ~ k and q selects among the possible choices for the first k axes. 
In MoA zero origin indexing is assumed. For example, if A is the 3 by 5 by 4 arrajQ 



12 3 
4 5 6 7 
8 9 10 11 
12 13 14 15 
16 17 18 19 



20 21 22 23 
24 25 26 27 
28 29 30 31 
32 33 34 35 
36 37 38 39 



40 41 42 43 
44 45 46 47 
48 49 50 51 
52 53 54 55 
56 57 58 59 



then 



<l>1pA: 



20 21 22 23 

24 25 26 27 
28 29 30 31 
32 33 34 35 
36 37 38 39 



<21>ipA = < 44 45 46 47 > 



<2 13>^pA = 47 

Most of the common array manipulation operations found in languages like For- 
tran 90, Matlab, ZPL, etc., can be defined from tp and a few elementary vector 
operations. 

We now introduce notation to permit us to define formally and to develop 
the Psi Correspondence Theorem ^3], which is central to the transformation of the 
DNF into the ONF. We will use A, B, ... to denote an array of numbers of any type 
(integer, real, complex, boolean, etc.). An array's dimensionality will be denoted by 
dA and will be assumed to be n if not specified. 

The shape of an array A, denoted by sa, is a vector of integers of length dA, 
each item giving the length of the corresponding axis. The total number of items 
in an array, denoted by tA, is equal to the product of the items of the shape. The 
subscripts will be omitted in contexts where the meaning is obvious. 

A full index is a vector of n integers that describes one position in an n- 
dimensional array. Each item of a full index for A is less than the corresponding 
item of Sa (due to a zero index origin). There are precisely Ia indices for an array. 
A partial index of ^ is a vector of < fc < n integers with each item less than the 
corresponding item of sa. 

We will use a tuple notation (omitting commas) to describe vectors of a fixed 
length. For example, 

< i j k > 

denotes a vector of length three. <> will denote the empty vector which is also 
sometimes written as O. 



^ In all examples, as above, we use consecutive integers as array elements although 
in practice, array elements can be arbitrary integer, real, or complex numbers. 
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For every n-dimensional array A, there is a vector of the items of A, which we 
denote by the corresponding lower case letter, here a. The length of the vector of 
items is tA- A vector is itself a one-dimensional array, whose shape is the one-item 
vector holding the length. Thus, for a, the vector of items of A, the shape of a is 



The precise mapping of A to a is determined by a one-to-one ordering function: 
7 (gamma). Although the choice of ordering is arbitrary, it is essential in the fol- 
lowing that a specific one be assumed. By convention we assume the items of A are 
placed in a according to the lexicographic ordering of the indices of A. This is often 
referred to as row major ordering. Many programming languages lay out the items 
of multidimensional arrays in memory in a contiguous segment using this ordering. 
Fortran uses the ordering corresponding to a transposed array in which the axes 
are reversed, that is, column major. Scalars are introduced as arrays with an empty 
shape vector. This way of viewing scalars (i.e. as empty arrays) is crucial to the 
consistency of the theory and will be discussed more fully in a later section. 

There are two equivalent ways of describing an array A: 

(1) by its shape and the vector of items, i.e. A = {sa, a}, or 

(2) by its shape and a function that defines the value at every index p. 

These two forms have been shown to be formally equivalent . We wish to use the 
second form in defining functions on multidimensional arrays using their Cartesian 
coordinates (indices). The first form is used in describing address manipulations to 
achieve effective computation. 

To complete our notational conventions, we assume that p, q, . . ., will be used to 
denote indices or partial indices and that u, v, . . ., will be used to denote arbitrary 
vectors of integers. In order to describe the ith item of a vector a, either a^ or a[i] 
will be used. If u is a vector of k integers all less than tA, then a[u] will denote the 
vector of length k, whose items are the items of a at positions Uj, j = 0, k — 1. 

Before presenting the formal definition of the tp indexing function we define a 
few functions on vectors: 



Sa = < iA > 



and the number of items or total number of component^ of a is 



ta = tA. 



n + u, u + n 
n X u, u X 71 



k V 



u -ff V 
U + V 



U X V 



k A 



u 



u 



catentation of vectors u and v 

itemwise vector addition assuming t^ — tv 

itemwise vector multiplication 

addition of a scalar to each item of a vector 

multiplication of each item of a vector by a scalar 

the vector of the first n integers starting from 

a scalar which is the product of the components of v 

when fc > the vector of the first k items of u, (called take) 

and when fc < the vector of the last k items of u 

when fc > the vector of t^ — k last items of u, (called drop) 

and when fc < the vector of the first t^ — |fc| items of u 



^ We also use ra, (5a, and pa to denote total number of components, dimensionality 
and shape of a. 
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k 9 u when fc > the vector of (fc y u) -H-(fc A u) 

and when fc < the vector or (fc A u) -H-(fc V u) 

Definition 2.1. Let A be an n-dimensional array and p a vector of integers. If p 
is an index of A, 

pipA = a[7(sA,p)], 

where 

7(sA,p) ~ Xn-i defined by the recurrence 
xo = Po, 

Xj = Xj_i * Sj + pj, j = l,...,n-l. 
Ifp is partial index of length k < n, 

ptl^A = B 

where the shape of B is 

SB = k \J SA, 

and for every index q of B, 

qtpB^ (p4fq)V^ 

The definition uses the second form of specifying an array to define the result of a 
partial index. For the index case, the function 7(3, p) is used to convert an index p 
to an integer giving the location of the corresponding item in the row major order 
list of items of an array of shape s. The recurrence computation for 7 is the one 
used in most compilers for converting an index to a memory address [35) . 

Corollary 2.2. <> i: A = A. 

The following theorem shows that a tf> selection with a partial index can be 
expressed as a composition of tp selections. 

Theorem 2.3. Let A be an n-dimensional array and p a partial index so that p = 
q4fr. Then 

ptpA — rtp{qtpA). 
Proof: The proof is a consequence of the fact that for vectors u, v, w 

(u -ffv) 4f w = u 4f (v -ffw). 

// we extend p to a full index by p 4+-p', then 



p'V(pV^) = (p4fp') V"^ 

= ((q4fr) -H-p')i>A 
= (q4f (r +hp')) 
= (r4fp')V'(qV'^) 
= p' tp{rip{ciipA)) 
p ipA = r ?/) (q ^A) 
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which completes the proof. 

We can now use tp to define other operations on arrays. For example, consider 
definitions of take ( A ) and drop ( y ) for multidimensional arrays. 

Definition 2.4 (take: A ). Let A be an n- dimensional array, and k a non-negative 
integer such that < k < sq. Then 

k A A = B 

where 

Si3 =< fc > 4f (1 V sa) 

and for every index p of B, 

pipB = p^A. 

(In MoA A is also defined for negative integers and is generalized to any vector u 
with its absolute value vector a partial index of A. The details are omitted here.) 

Definition 2.5 (reverse: Let A be an n-dtmenstonal array. Then 

S(<PA) = Sa 

and for every integer i, < i < sq, 

<i> ip {$A) =< So - i - 1 > ipA. 

This definition of <I? does a reversal of the Qth axis of A. 

Note also that all operations are over the 0th axis. The operator fi TS] extends 
operations over all other dimensions. 

2.1.2 Example 

Consider the evaluation of the following expression using the 3 by 5 by 4 array, A, 
introduced in Section [2. 1.11 

< 1 2 > i/.(2 A {<L>A)) (2.2) 
where A is the array given in the previous section. The shape of the result is; 

2 V S{2 A {'PA)) 

= 2v(<2>4f(lV S(<fA))), 
= 2v(<2>4f(lV sa)), 
= 2 V (< 2 > + < 54 >), 
= 2v <254 >, 
= < 4 > . 

The expression can be simplified using the definitions: 

< 1 2 > 7/>(2 A i^A)) 

= < 1 2 > 7/'(*A), 

= < 2 > 7/>(< 1 >'4>{<L>A)), 
= <2>i/)(<3-l-l> ipA), 
= < 1 2 > V^. 

(2.3) 
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This process of simplifying the expression for the item in terms of its Cartesian 
coordinates is caUed Psi Reduction. The operations of MoA have been designed so 
that all expressions can be reduced to a minimal normal form [13| . 
Some MoA operations defined by tp are found in Fig. 12.11 
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Symbol 




Description 




Dimensionality 


Returns the number of diniciisioiis of an array. 


p 


Shape 


Returns a vector of the upper bounds 
or sizes of each dimension in an array. 




Iota 


When n — (scalar), returns a vector containing elements 
0, to — 1. When n — 1 (vector), returns an 
array of indices defined by the shape vector 


tp 


Psi 


The main indexing function of the Psi Calculus 
which defines all operations m MoA. Returns a scalar 
if a full index is provided, a sub-array otherwise. 


rav 


Ravel 


vectorizes a multi-dimensional array based 
on an array's layout {'^row^lcol^lsparse^ •••) 


7 


Gamma 


Translates indices into offsets given a shape. 


/ 

7 


Gamma Inverse 


Translates offsets into indices given a shape. 


s p 5 


Reshape 


Changes the shape vector of an array, possibly affecting 
its dimensionality. Reshape depends on layout (7). 




Pi 


W e turns a scalar and is e equivalent to ^ ^ ^ x 


T 


Tau 


Returns the number of components in an array, (r^ = '7r(p^)) 




Catenate 


Concatenates two arrays over their primary axis. 




Point-wise 
Extension 


A data parallel application of / is performed 
between all elements of the arrays. 




Scalar Extension 


(7 is used with every component of ^r- in the data parallel 

application of /. 


A 


Take 


Returns a sub-array from the beginning or end of an array 
based on its argument being positive or negative. 


V 


Drop 


The inverse of Take 


op red 


R.cducc 


Reduce an array's dimension by one by applying 
op over the primary' axis of an array. 


<f 


Reverse 


Reverses the components of an array. 


e 


R.otate 


Rotates, or shifts cyclically, components of an array. 


to 


Transpose 


Transposes the elements of an array based on 
a given permutation vector 


r2 


Omega 


Applies a unary or binary function to array argument (s) 

given partitioning information. Q is used to perform all operations 

(defined over the primary axis only) over all dimensions. 



Fig. 2.1. Summary of MoA Operations 



2.1.3 Higher Order Operations 

Thus far operation on arrays, such as catenation, rotation, etc., have been per- 
formed over their Qth dimensions. We introduce the higher order binary operation 
Q, which is defined when its left argument is a unary or binary operation and its 
right argument is a vector describing the dimension upon which operations are to be 
performed, or which sub-arrays are used in operations. The dimension upon which 
operations are to be performed is often called the axis of operation. The result of Q 
is a unary or binary operation. 

2.1.4 Definition of Q 

Q is defined whenever its left argument is a unary or binary operation, f or g 
respectively (/ and g include the outcome of higher order operation). f2's right 
argument is a vector, d, such that pd =< 1 > or pd =< 2 > depending on whether 
the operation is unary or binary. Commonly, / (or g) will be an operation which 
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determines the shape of its result based on the shapes of its arguments, not on the 
values of their entries, i.e. for all appropriate arguments is determined by 

and p(£,ig£,r) is determined by p^j and p^r- 

Definition 2.6. /f2ci ts defined when f is a one argument function, d =< a >, with 
a>0. 

For any non-empty array ^, 

fO^C (2.4) 

is defined provided (i) (6^) > a, and provided certain other conditions, stated below, 
are met. Let 

We can write 

Pi 

where z = (— cr) A p^. 

We further require (n) there exists w such that for <* i <* u, 

/(i^C) (2.7) 

is defined and has shape w. The notation <* i <* u, is a shorthand which implies 
that we are comparing two vectors i and u component by component. With this 

p(/J7d)C s u -H-w (2.8) 

and for <* i <* u, 

ii;if{2aO = fiii^O (2-9) 
Note that condition (ii) is easily satisfied for common f's. 

Definition 2.7. We similarly define Q when its function argument is a binary op- 
eration g. gf^d is defined when g is a two argument function, d =< ai ar >, with 
ai > 0, and ar > 0. 

For any non-empty arrays, ^i, and ^r, 

il{,^d)ir (2.10) 

is defined provided (i) {SS,i) > ai and (5^r) > cTr, and provided certain other condi- 
tions, stated below, are met. 

We let [ denote the binary operation minimum and let 

m=mi)-ai)[mr)-ar). 

We require that (ti) ((— m) A {—oi) v pS,i) = {{—'m) ^ {^(^r) V P&)- 
Let 

X = ((-m) A (-ai) V P6) = ((-"0 A i-ar) V P&), 
u = (-m) V i-^i) V pC;> 

V = (-m) V i-O-r) V Pir. 

Note that u =<> or v =<> (both could be empty). We can now write 

pii s u -ff X -H-y, (2.15) 



-0-) V Pi- 

s u-H-z 



(2.5) 
(2.6) 



(2.11) 



(2.12) 
(2.13) 
(2.14) 
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and, 

pCr=v-ffx^z (2.16) 

where y = (—a;) A p^i and z = (— cr^) A p^r- Any of the vectors above could be 
empty. 

We also require (Hi) there exists a fixed vector w such that for <* i <* u, 
<* j <* V, <* k <* X, 



((i ^k)V^Ci)5((j^k)V^C.) 



is defined and has shape w. 
With all this 



(2.17) 



p(6(9^2d)Cr) = u -ff V -ff X -ff w (2.18) 

and for <* i<* u, <* j <* v, <* k <* x, 

(i ^k) V {^l{gf2a)ir) = ((i *k) -0^0 g ((j ^k) ^^r) (2.19) 

Since at least one of u, v is empty, the corresponding one of i, j must also be 
empty. We note the condition {Hi) is easily satisfied for common g's. 

Consider the following example. The operator is defined for scalar and vector 
left arguments and n-dimensional array right arguments. Thus for 9, valid d's are 
< On > and < In >. Let 



12 3 4 
5 6 7 8 



then 



< 21 > e^2<oi>$^ = 



3 4 12 
6 7 8 5 



(2.20) 
(2.21) 



2.2 Contrasting MoA with Linecir Algebra 

As stated previously, one can think of MoA as a generalization and extension of 

standard Linear Algebra. In this section we draw the readers attention to a few 
important differences between MoA and Linear Algebra. 



2.2.1 Scalars as Arrays 

In MoA every object is an array including a scalar. Scalars are considered to 
be zero- dimensional arrays. Often we use the greek letter cr to denote a scalar. The 
shape of a scalar is the empty vector < > as: 

pa =<> . (2.22) 

In general there is an infinite collection of empty arrays. Any multi-dimensional 
array with one or more empty dimensions (i.e. the shape vector contains at least 

one zero element) is called an empty array. More formally, we say that an empty 
array is one for which the product of the elements of the shape vector is zero. That 
is: 

w{pi) = 0. (2.23) 
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Thus for scalars we have: 

p(p(7)=<0>, (2.24) 

that is, a scalar is a zero-dimensional array. In general the shape of the shape gives 
the number of dimensions. 

The notion of a scalar as an array will undoubtedly seem strange to most readers 
and may seem to be an arcane construct, however, the distinction is essential for 
consistency of the theory just as the number is essential to the system of integers 
under the operation of addition or the number 1 is under the operation of multipli- 
cation. It is the analog of the empty set in set theory and the identity operation in 
group theory. Note: there is a difference between a zero dimensional array (a scalar) 
and a one-dimensional array with one element! In the first case the shape is empty 
and in the second the shape is a one-element vector containing the single element 1. 

To illustrate this concept consider the following arrays: cr, < cr > and [a]. The 
first (7, is a scalar or zero-dimensional array, the second < cr > is a one element vector, 
and the third [a] is a 1 x 1 array (square brackets denote two-dimensional arrays as 
is common in Linear Algebra). In traditional Linear Algebra, there is no distinction 
between these three examples. In MoA however, the three arrays are distinguished 
by their shapes: 

pa =< >, (2.25) 
p<(r>=<l>, (2.26) 

and, 

p[a]=<ll>. (2.27) 

Thus a ^<cr>/ [a], because the corresponding dimensionalities, respectively 
given by: 

p(pa)=<0>, (2.28) 
p{p<cT>)=<l>, (2.29) 

and, 

p(p[a])=<2>, (2.30) 

are not equal. 



2.2.2 Need for Empty Arrays 

In the previous section we introduced the notion of scalars as arrays and of empty 
arrays. These subtle distinctions are essential in that our theory is based on shapes. 
In general, the dimensionality of an array changes as operators act on them. As a 
simple example, think of the operator corresponding to the standard inner product 
{dot product). This operator takes two vectors and produces a scalar. The operator 
corresponding to the standard outer product {direct product or cartesian product) 
takes two vectors and produces a matrix. Another example exists in the concept of 
a functional. A functional takes a function (which can be thought of as a vector in 
an infinite dimensional space) and returns a scalar. 

In MoA this concept is completely general. One can imagine a sequence of opera- 
tions that convert an n-dimensional array into a m dimensional array (for m smaller 
or larger than n) . If such a sequence of operations acts to reduce the dimensionality 
of the result with each step, the natural stopping point (i.e. the boundary condition) 
is the scalar (i.e. a zero-dimensional array). 
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2.2.3 Graphical Representation 



Any multi-dimensional array can be represented graphically using the vector angle 
brackets (< and >) and the square brackets ([ and ]). In sect ion [2. 1.1 1 we represented 
a three-dimensional array as a series of two-dimensional arrays next on one another. 
We often find it convenient to represent arrays by nesting two-dimensional arrays. 
We illustrate this for the hyper-cube, below. An n-dimensional hyper-cube is an n- 
dimensional array in which the length of each dimension is 2. For a two-by-two array 
we write: 



(22) 



1 
2 3 



(2.31) 



This is an example of a two-dimensional hyper-cube, and a four-dimensional hyper- 
cube would be written as: 



(2222) 





1 




4 5 






2 3 




6 7 






8 9 




"l2 13' 






10 11 




14 15 





(2.32) 



2.2.4 Notational Subtleties 

It is essential to be always aware of the shape of the array in order to avoid notational 
confusion. For example, the array 

v=<a6>, (2.33) 

is a one-dimensional array (i.e. a vector) with two elements, while the array 

i^[ahl (2.34) 

is a two-dimensional array (i.e. it is a 1 x 2 array) with two elements. The difference 
is determined by their shapes. Explicitly we have: 



pv =<2>, 



(2.35) 



and, 



pe=<12>. (2.36) 

As discussed in previous sections,, we use an index vector and the tp operator in 
order to select elements of the arrays. Thus: 



<0> V'v = a, 



(2.37) 



and 



<l>Vv = 6, (2.38) 

for the one-dimensional representation, and for the two-dimensional representation 
we have: 

<0 0>7/'C = a, (2.39) 

and, 

<0 1>?/'C = b. (2.40) 
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Consider now the important difference between MoA and standard Linear Alge- 
bra. The concept of a row vector exists in MoA as in standard Linear Algebra: 



(2.41) 



In contrast to Linear Algebra, however, there is no concept of a column vector. To 
access the elements of what would normally be called a column vector we use the 
higher-order O operator (see the appendices for the definition of this operator). 
Note also the following inequality, 



a b 




<ab> 


c d 




<cd> 



<ab> 
<cd> 



a b 
c d 



(2.42) 



2.2.5 Addition and Multiplication of Arrays: Comparing and 
Contrasting with Linear Algebra 

The following operation on two arrays of shape < 2 2 > (i.e. 2x2 matrices) is 
identical in MoA and standard Linear Algebra: 



a b 
c d 



+ 



(a + e) (b + f) 
(c + g) (d + h) 



(2.43) 



Subtraction of two arrays is defined in a similar way. In both cases we find elements 
of the two arrays are combined in a point-wise fashion. 

With matrix multiplication, however, we find an important distinction. In MoA, 
multiplication, like addition and subtraction, occurs also in point-wise fashion: 



a b 
c d 



(a X e) (6 X /) 
(c X g) {d X h) 



Similar definitions exist for all scalar operations (e.g. -)-,—, x, /). 
The operation corresponding to standard matrix multiplication: 

{a X e + h X g) {a x f + b x h) 
{c X e + d X g) {c X f + d X h) 



a b 








c d 




9 h 





(2.44) 



(2.45) 



in MoA corresponds to the following sequence of operations. First we form the 
following two matrices: 

(a X e) (ax /) 
(c x e) (c X /) 



A ■ 



(2.46) 



and, 



B = 



(2.47) 



[bxg) (bxh) 
[d X g) (d X h) 

The matrices A and B are then added to produce the result of Eq. 12.451 

The matrices A and B can be seen to be constructed as outer products of the 
vectors < a c > and < e / > to form A and <b d> and < g h> to form B. Thus 
by considering the notions of matrix addition and matrix multiplication in standard 
linear algebra we are naturally led to the MoA operations: (1) point-wise extension 
of scalar operation and (2) outer product. These constructs are made precise in the 
following two definitions. 
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Definition 2.8 (Point- wise extension of scalar operations). Point-wise ex- 
tension of a binary operation "op" between two non-empty arrays and ^2, such 
that p^i = p^2, has shape: 

p{iiop^2)=pii, (2.48) 
and for valid indices <* i <* p^i, is given by: 

iV-feop^) = (iVCi)op(iV6)- (2.49) 
Examples of valid binary operations, op, include include +, — , X, /, etc. 

Definition 2.9 (Outer product). The outer product, •op of two arrays and f,. 
has shape: 

p{0»op^r) = {p^l)Mp^r) (2.50) 

and for valid indices <* i <* p^;, and <* j <* p(,r, is given by: 

(i -H-j) (Ci 'op ^r) = (i iP^i) op (j ,p^r). (2.51) 



Thus we see that matrix multiplication of standard Linear Algebra is a special 
case of Def. 12.81 with op = +, and the standard tensor product is a special case of 
Def.l^with op = X. 
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A Cache-Optimized Fast Fourier Transform: 
Part I 



3.1 Chapter Summary 

The material in this chapter is taken from a (larger) paper that is to appear in the 
Journal of Computational Physics. 

Our subject in this and the following two chapters is the design and a Fast 
Fourier Transform algorithm designed to optimize data locality in the cache. The 
algorithm is presented and discussed using traditional concepts familiar to scientists 
and engineers. In this chapter new concepts based on Conformal Computing tech- 
niques are introduced gradually and illustrated in context. The following chapter, 
serves as a stand-alone tutorial on Conformal Computing techniques that are de- 
veloped and illustrated in the context of the new FFT algorithm. We find favorable 
performance of the new algorithm without any machine-specific optimizations. In 
particular we find the new routine to be a factor of 2 to 4 times faster than our 
previous design that often outperformed well-tested library routines such as ESSL, 
IMSL, FFTW, or NAG (see Chap. [1] and references therein). 

The results presented in these chapters are promising for further developments 
in terms of optimizations over processor/memory hierarchies because the algorithm 
can be generalized to arbitrary partitioning over any number of levels of the pro- 
cessor/memory hierarchy. More importantly, this research illustrates the power of a 
uniform, mechanical, mathematically based design strategy that leads to portable, 
scalable, and verifiable software or hardware. 



3.2 Introduction 

Our new Fast Fourier Transform algorithm represents a significant application of a 
mathematically rigorous, systematic, design protocol that the authors have named 
Conformal Computing. The vision of Conformal Computing is to algebraically con- 
nect the hardware and software through linear transformations from high-level spec- 
ifications of the problem to the low- level instruction sets of the underlying hardware. 
In the early days of computing, in which programs were written directly in assem- 
bler, this vision was more easily realized on single-processor, single-memory systems. 
Today, however, the situation is considerably more complex in that there are many 
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levels of software, and processor- memory separating the high-level problem specifi- 
cation and the hardware. 

The Fast Fourier Transform (FFT) is one of the most important computational 
algorithms and its use is pervasive in science and engineering. The research presented 
in this chapter represents a significant improvement to the FFT algorithm resulting 
in a factor of four speed-up for some of the largest systems tested in comparison with 
our previous records. Our previously published work indicates that our FFT is com- 
petitive with or outperforms standard library routines without any machine-specific 
optimizations (see Chap. [1] and Ref. |19)). This success is achieved through optimiz- 
ing in-cache operations. In the traditional FFT, data access becomes progressively 
remote (leading to cache misses and page faults) as the algorithm proceeds. In our 
new approach data is periodically rearranged so as to maximize data locality. 

Our algorithm can be seen to be a generalization of similar work aimed at out- 
of-core optimizations [21]. Similarly, block decompositions of matrices (in general) 
are special cases of our reshape-transpose design. Most importantly, our designs 
are general for any partition size, i.e. not necessary blocked in squares, and any 
number of dimensions. Furthermore, our designs use linear transformations from an 
algebraic specification and thus they are verified. Thus, by specifying designs (such 
as Cormen's and others) using Conformal Computing techniques, these designs too 
could be verified. 

A general algebraic framework for Fourier and related transforms, including 
their discrete versions, is discussed in [32]. As discussed in [37] [HS] and using this 
framework, many algorithms for the FFT can be viewed in terms of computing 
tensor product decompositions of the matrix Bl, discussed below (see Fig. 13. 
Subsequently, a number of additional algorithms for the FFT and related problems 
have been developed centered around the use of tensor product decompositions |39l 
|40l|4T]|42]|43j|44]. The work done under the acronym FFTW is based on a compiler 
that generates efficient sequential FFT code that is adapted to a target architecture 
and specified problem size 45, 46] 1471 1481 149j . A variety of techniques have been 
used to construct efficient paraUel algorithms for the FFT [50l [5T] [52] [53] [54l [55] . 
Other important FFT implementations are discussed in [3D] and [56) . 

The purpose of this paper IS NOT to attempt any serious analysis of the num- 
ber of cache misses incurred by the algorithm in the spirit of of Hong and Kung 
and others [251 [13| [21] . Rather, we present an algebraic method that achieves (or 
is competitive) with such optimizations mechanically. Through linear transfor- 
mations we produce a normal form, the ONF, that is directly implementable in 
any hardware or software language and is realized in any of the processor/memory 
levels ^5]. Most importantly, our designs are completely general in that through 
dimension lifting we can produce any number of levels in the processor/memory 
hierarchy. 

One objection to our approach is that one might incur an unacceptable perfor- 
mance cost due to the periodic rearrangement of the data. This will not, however, 
be the case if we pre-fetch data before it is needed. The necessity to pre-fetch data 
also exists in other similar cache-optimized schemes. Our algorithm does what the 
compiler community calls Uhng. Since we have analyzed the loop structures, access 
patterns, and speeds of the processor/memory levels, pre-fetching becomes a deter- 
ministic cost function that can easily be combined with reshape-transpose or tiling 
operations. 
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Again we make no attempt to optimize the algorithm for any particular archi- 
tecture. We provide a general algorithm in the form of an Operational Normal Form 
that allows the user to specify the blocking size at run time. This ONF therefore en- 
ables the individual user to choose the blocking size that gives the best performance 
for any individual machine. 

We now begin our discussion of the new algorithm with a discussion of previous 
efforts applied to index optimizations of the traditional FFT. 



3.3 Index Optimizations for the Traditional FFT 
3.3.1 Traditional FFT Algorithm 

We begin by reviewing the traditional FFT and its recent refinements using the 
■i/j-calculus. Some of the following discussion is excerpted from Ref. [19j . 

We began with Van Loan's [53 high-level MATLA^ program for the radix 2 
FFT, shown in Fig. 13.11 This program denotes a single loop program, with high level 
array/vector operations and reshaping. 



Input: X in and n — 2*, where t > is an integer. 
Output: The FFT of x. 

X Pji X 

for q — 1 to t 
begin 
L ^ 2^ 

r *— n/L 

end 



Here, P„ is a n X n permutation matrix, Bl 

is a diagonal matrix with values 1,ujl, . . . .ui^ 
L'th root of unity. 



(1) 

(2) 
(3) 
(4) 
(5) 
(6) 
(7) 



L/2, and Ol, 



along the diagonal, where ujl is the 



Fig. 3.1. High-level program for the radix 2 FFT. 



In Line 1 of Fig. 13.11 P„ is a permutation matrix that performs the bit-reversal 
permutation on the n elements of vector x. In Line 6, the n element array x is 
regarded as being reshaped to be a L x r matrix consisting of r columns, each of 
which is a vector of L elements. Line 6 can be viewed as treating each column 
of this matrix as a pair of vectors, each with L/2 elements, and doing a butterfly 
computation that combines the two vectors in each pair to produce a vector with L 
elements. 

The reshaping of the data matrix x in Line 6 is column-wise, so that each 
time Line 6 is executed, each pair of adjacent columns of the preceding matrix are 
concatenated to produce a column of the new matrix. The butterfiy computation, 
corresponding to multiplication of the data matrix x by the weight matrix Bl, 



^ MATLAB is commonly used in the scientific community as a high-level prototyp- 
ing language 
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combines each pair of L/2 element column vectors from the old matrix into a new 
L element vector of values for each new column. 

Let us now interpret the algorithm of Fig. 13.11 using traditional concepts of 
matrix algebra. The algorithm of Fig. 13.11 is equivalent to an iterative sequence of 
operations in which an n x n matrix is multiplied by an n-dimensional data vector. 
The number of such operations is given hy t = log2(n) and the matrix is different 
for each iteration as is discussed below. 

The first vector a;^^' is given by 

^(1) ^jVf(^)x(°), (3.1) 

where is the initial bit-reversed data vector The first-iterate matrix 

is an n X n block-diagonal matrix consisting of n/2, 2 x 2 matrices B2 where 
the matrix Bl is defined in Fig. 13.11 The next step of the algorithm produces the 
second iterate vector x^'^^ 

^(2)^^(2)^(1)^ (3.2) 

from the first-iterate vector x^^^ through multiplication by the second-iterate matrix 
M^'^\ The matrix M'^-* is a n x n block-diagonal matrix consisting of n/4, 4x4 
matrices B4, along the diagonal. This process continues in a straightforward fashion. 
The final step of the process is given by 

^(t)^^(t)^(t-i)^ (3.3) 
where the t-ih iterate vector a;'*' is the Fourier Transform of the original data vector: 

x'-'^ = FFT{x), (3.4) 

and the matrix M'*' = B„. 

3.3.2 Index Optimization Using the ■0-CalcuIus 

The discussion of the previous subsection is useful for the purposes of illustrating 
the basics of the FFT but is inefficient. The essence of the developments of Ref. 
is the removal of all temporary arrays through the use of Conformal Computing 
techniques. The concise algorithm illustrated in Fig. 13. H is the starting point for the 
developments of Ref. [19] . Essential to this analysis is the notion of partitioning and 
reshaping the data to maximize efficiency by taking advantage of the sparseness of 
the matrices M^'' (for l<q<t). 

The final result for the radix-2 FFT is presented in Fig. 13.21 (reproduced from 
Fig. II. H and Ref. [19]). As was demonstrated in Ref. [19] and reproduced in Chap.[T] 
this implementation is competitive with or outperforms a variety of standard library 
routines. Such high performance is a consequence of the fact that, through the 
optimal use of matrix/vector indexing, no temporary arrays are used. 

3.3.3 Optimizing Array Access Patterns 

The key result of the present paper is a generalization of the algorithm of Fig. I3.2l in 
which performance is increased (factor of two to four speedup for moderate to large 
FFT's) through repeated restructuring of the data so as to minimize cache misses 
and page faults. 
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do q = 1 , t 
L = 2**q 
do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do col' = 0,n-l,L 
do row = 0,L/2-l 

c = weight (row) *x (col +row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 

Fig. 3.2. Final index-optimized radix-2 FFT with in-place butterfly computation. 
This implementation eliminates the need for temporary arrays through the optimized 
use of array indexing. 

Note the array access patterns implied by the code fragment of Fig. 13.21 In 
particular, as the outer loop variable increases (i.e. g = 1, 2, ■ • •) the stride of 
the data access (i.e. the difference between the index of x(col'+row) and that of 
x(col'+row+L/2) ) is doubling with each increment of q. For sufficiently small values 
of q, both x(col'+row) and x(col'+row+L/2) reside in cache and access is fast. How- 
ever, as q increases, at some point L/2 is larger than the cache size (first LI cache and 
subsequently L2 cache, etc) and accessing both x(col'+row) and x(col'+row+L/2) 
results in a cache miss. For large FFT's (i.e. those that don't fit into cache, main 
memory, paged memory, etc.) the performance continues to deteriorate with increas- 
ing q as the increasing separation of x(col'+row) and x(col'+row+L/2) requires ac- 
cessing higher levels of the memory hierarchy (main memory, paged memory, etc) 
leading to page faults etc. This problem, resulting from data non-locality is common 
to all previous implementations of the FFT. 

The essence of the new cache-optimized algorithm, set forth in this paper, is 
the notion that periodic restructuring of the data array x (i.e. actually moving the 
data around to achieve locality) is less costly than the penalty which is otherwise 
incurred as a result of cache misses and page faults. 

3.4 Cache-Optimized FFT: Key Elements of the New 
Approach 

3.4.1 Restructuring the Data: the Reshape- Transpose Operation 

The key data-restructuring operation used in the new algorithm is called the reshape- 
transpose operation. In order to understand this operation we must first consider the 
data vector x as a one- dimensional array of length n. Next we consider reshaping 
the array into a collection of vectors of length c so as to form a two-dimensional 
array of dimension r x c where the number of rows is given by r = n/c. Using the 
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i/)-calculus notation we write a;'"-* =< r c > p :k. The length of each row c is chosen 
arbitrarily, and is specified as a parameter to the algorithm. In practice, however, 
we find that optimal performance is generally obtained for c less than or equal to 
the cache size. 

At this point an example would be helpful. Consider, for simplicity, the initial 
data vector to be a vector of length n = 32 consisting of sequential integers starting 
from zero. In i/)-calculus notation we write x — t(32). Next we choose c — 4 and 
reshape the one- dimensional array x to be an 8 x 4 array a;^"-* by filling in the entries 
in lexical order (i.e. in row-major order as is done for arrays in C++). Using the 
-i/i-calculus notation: a;'"-* =< 8 4 > p t(32). Explicitly we have: 



Now consider the FFT access patterns of the array a;'"' in light of the algorithm 
of Fig. 13.21 In Fig. 13.21 elements x(col'+row) and x(col'+row+L/2) are accessed 
and combined pairwise in the following order: (1) first x(0) and x(l) are combined, 
followed by x(2) and x(3), followed by x(4) and x(5), etc. In the next step the 
stride doubles leading to the following combinations: (2) elements x(0) and x(2) are 
combined, followed by x(l) and x(3), next x(4) and x(6), next x(5) and x(7) etc. 

Now consider the next step of algorithm of Fie:. l3.2l in the context of the re-shaped 
array of Eq. 13.51 The next step of the algorithm, (3) combines elements x(0) and 
x(4), x(l) and x(5), x(2) and x(6), etc. If, in Eq. 13.51 the row length c corresponds 
to the cache size, the process of combining x(0) and x(4) leads to a cache miss. 
Likewise combining x(l) and x(5) also leads to a cache miss etc. In fact, all further 
operations lead to cache misses, page faults etc. 

At this point, in order to avoid cache misses, we re-structure the data array using 
the reshape-transpose as discussed in the following. In effect we wish to re-order the 
array by sequentially taking the elements of the columns and placing them into the 
rows in lexical order. The first row of the reshape-transposed array would therefore 
consist of the elements 0, 4, 8, 12. The next row would consist of the elements 16, 
20, 24, and 28. Proceeding to the next column of a;'"^ leads to the following elements 
in the third row: 1, 5, 9, 13, etc. 

The operation is called reshape-transpose because we can think of the process of 
occurring in two stages. In the first stage we transpose the array. In standard matrix 
language we would write a;'*"' = Tx^°'\ The corresponding operation (transpose) in 
■i/'-calculus notation is expressed by the symbol ©. Thus we write: 



X 



(a) _ 



< 8 4 > p(t(32)) 



"0123- 
4 5 6 7 
8 9 10 11 
12 13 14 15 
16 17 18 19 
20 21 22 23 
24 25 26 27 

.28 29 30 31. 



(3.5) 



X 



(b) _ 



®(< 8 4 > p (t32)) 



4 8 12 16 20 24 28 

1 5 9 13 17 21 25 29 

2 6 10 14 18 22 26 30 

3 7 11 15 19 23 27 31 



(3.6) 



In the second stage of the reshape-transpose operation we reshape the result of the 
transpose operation to give: 
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a;'"' =< 8 4 > p (©(< 8 4 > p (t32))) = 



Note, in practice, the reshape-transpose operation which transforms the array of 
Eq. l3.5l into that of Eq. l3.7l is carried out in a single step using T/i-calculus indexing 
techniques. The two-step process indicated by Ea. l3.6l (for the transpose) and Eg. 13.71 
(for the subsequent reshaping) was given merely for the purpose of illustration. 

Now, by working with the data as restructured according to Eq. 13.71 we find 
that the data needed for step (3) of the algorithm of Fig. 13.21 (i.e. elements x(0) 
and x(4), elements x(8) and x(12) etc.), are now close to one another thus reducing 
cache misses. We thus continue to process the data by combining only elements that 
are within a given row. The combinations for the first row, are therefore, x(0) and 
x(4), x(8) and x(12), x(0) and x(8), and lastly x(4) and x(12). The next step would 
be to carry out similar operation for the remaining rows. 

The last step of the FFT (for this example) requires the combination of elements 
x(0) and x(16), x(l) and x(17), etc., which would give rise to cache misses using 
the data as structured in Fig. 13.71 Therefore we, once again, restructure the data by 
carrying out another reshape-transpose operation to yield: 



X^"^ =<8 4> p (®sf'=')EE 



Now the last operations of the FFT (for this example) involve only elements within 
a given row. These operations combine x(0) and x(16), x(l) and x(17), x(2) and 
x(18), x(3) and x(19) etc. 

Once the process of computing the FFT is completed, there remains the final step 
of putting all of the data back in the correct order. One way to accomplish this would 
be to undo the multiple reshape-transpose operations. Using ?/;-calcuIus indexing 
techniques, it is possible to put all of the data back in the proper order 
in one step as is discussed in section 14.4.21 

3.4.2 Re-Ordering of "Butterfly" Operations in the 
Cache-Optimized FFT 

In addition to changing the access patterns so as to achieve data locality, we are 
also changing the order in which the various operations of the FFT are carried out. 
Figure [3]3] illustrates pictorially the first few operations of the FFT before any data 
restructuring has taken place. The operation of, say, combining elements x(0) and 
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1 5 9 13 

17 21 25 29 

2 6 10 14 

18 22 26 30 

3 7 11 15 

19 23 27 31 



(3.7) 
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x(l) to give the new values for x(0) and a;(l), in place (i.e. without the need for 
temporary arrays) is called the "butterfly" because of the way it is traditionally 
drawn (with vertical and diagonal lines) as in Fig. 13.31 Even at this stage (before 
any data restructuring occurs) we carry out the butterfly operations in non-standard 
order. 



12 3 



8 9 10 11 




etc. 



Fig. 3.3. Schematic illustration of some of the access patterns for first two cycles 
of the FFT (some patterns omitted for clarity). 



In the traditional FFT, all butterfly operations involving the smallest strides 
are carried out first before moving to larger strides (e.g. all butterfly operations 
involving nearest-neighbors such as a;(0) and x{V) are carried out before moving to 
next-nearest-neighbors such as x{Q) and x{2) etc.). In the new algorithm, however, 
we want to maximize in-cache operations so we carry out all operations with the 
first set of data that fits in cache before moving on. This is illustrated in Fig. 13.31 as 
follows. 

In keeping with the example of the previous subsection we are setting c = 4 
(nominally the cache size for this example). The vertical bars separating groups of 
four numbers (e.g. |0 1 2 3|; |4 5 6 7|, etc) schematically indicate cache boundaries. 
Thus we perform all operations of the FFT (with sequentially increasing strides) 
within a given data vector of length = c until the point at which the stride equals 
c. At that point we move the next group and repeat the process until all groups of 
data have been exhausted. To proceed to the next step (where the stride = c) would 
lead to cache misses for the data structured as in Fig. 13.31 At this point, therefore, 
we carry out the reshape-transpose operation to restructure the data. 

In the next two cycles of the FFT we work with the re-structured data so as 
to achieve data locality as illustrated in Fig. 13.41 Note that the access patterns are 
identical to those of Fig. 13.31 and that all operations which can be carried out with 
a given set of data elements (i.e. those which fit within a vector of length c) are 
performed before moving on to the next group. 

At this point one can see the general pattern emerging. (1) The index-optimized 
radix-2 FFT of Fig. 13.21 is carried out within a given data block of length c albeit 
with modified weights (to be discussed in the next subsection). (2) Then an outer 
loop cycles over the data blocks. Following that, (3) the data is re-arranged by 
carrying out a reshape-transpose until the last re-arrangement is achieved. (4) At 
this point, one must decide how many cycles of the FFT need to be done after the 
last reshape-transpose as, in general, the number of cycles after the last reshape- 
transpose might be less than log2(c). (5) Lastly the original ordering of the data is 
restored (as discussed in section H. 4. 2p . 
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4 8 12 




16 20 24 28 1 5 9 13 




etc. 



Fig. 3.4. Schematic illustration of some of the access patterns for next two cycles 
of the FFT after the transpose/reshape (some patterns omitted for clarity). 



In the last paragraph, in step (4), we mentioned that after the final reshape- 
transpose the number of FFT cycles within a given data block of length — c might 
be fewer than for all the other stages of the calculation. This can be clearly seen 
in the present example. As illustrated in Fig. 13.51 in the last stage of the present 
example, only cycles of length 2 are needed (as opposed to cycles of 2 and 4 for 
previous orderings). 



16 1 17 



2 18 3 19 



4 20 5 21 



etc. 



Fig. 3.5. Schematic illustration of some of the access patterns for the final stage 
of the FFT in which only cycles of length 2 are required. 



3.4.3 The Number of Reshape- Transpose Operations 



The total number of reshape-transpose operations and the number of FFT steps 
between reshape-transpose operations is determined by the two parameters c and 
total length of the input data array n. The illustrations of the butterfly operations 
presented in Figs. 13.31 13.41 and 13.51 illustrate the notion of the stride of the data 
elements being combined. The stride is simply the number of memory locations 
separating a given two elements. From the definition of the block matrices Bl (see 
Fig. 13. ip we easily identify the stride to be equal to L/2. Since the stride doubles 
with each iteration of the FFT, we find that a stride of length c/2 (the maximum 
stride which stays within the vector of length c) is reached after log2{c) steps of the 
calculation. Thus there are log2{c) FFT steps between subsequent reshape-transpose 
operations. 

The total number of iterations of the FFT is log2{n), therefore the total number 
reshape-transpose operations is the integer part of the ratio log2{n)/log2{c). 
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3.4.4 Re-Ordering of the Weights 

The book-keeping for the weights required for the new algorithm is presented in 
this subsection. We wish to implement the new algorithm as a generalization of 
those illustrated in Figs. 13.11 and 13.21 In order to do that we must keep track of 
which weights go with which elements of the data array as it is re-arranged via the 
sequence of reshape-transposes. 

At this point it is convenient to return to the notation of subsection 13.3.11 in 
which we illustrated a given iteration of the FFT as the multiplication of a block- 
diagonal n X n matrix by an n-dimensional vector. As discussed in the previous 
subsection, we carry out log2{c) iterations of the FFT (the last of which having a 
stride — c/2) and then re-arrange the data with a reshape-transpose operation. Then 
we carry out another log2{c) steps before re-arranging, etc. Thus the last iteration 
before the first reshape-transpose is written as: 

^(iog2{c)) _ j^,^(ioS2{c))^{!o92(c/2)) g-j 

Thus the first operation requiring the use of the restructured data is written as: 

^(log2(2c)) _ j^^j{io92(2c))^(!o92{c)) -j^g-j 

Let us write the reshape-transpose operation as a linear transformation defined by 
the matrix 5* acting on the data vector 

yW ^ g^(log2{c)) ^ (3.11) 

where we have introduced the notation i/'"' to indicate the data as re-ordered by the 
reshape-transpose operation. Next we transform Eq. I3.1UI by multiplying both side 
of the equation, on the left, by the matrix S and introducing the identity / = S~^S 
between the matrix m'^'"^^'^^''^^ and the vector a;('°»2(c)) gj^g. 

which, in turn we write as: 

yW=iV«j/(°). (3.13) 
In equation 13. 131 we have introduced the re-ordered weight matrix: 

AT*') = 5'M''°''^(^"»S"\ (3.14) 

which defines the first iterate of a sequence of log2 (c) transformations involving the 
re-structured data. We have also introduced the first-iterate data vector y^-^K 

This iterative procedure continues in an analogous fashion to that outlined in 
subsection IXTT] using the matrices A^*^) = ^^^('"s^Ccj+j)^-! i < j < log2{c). 
The last step in this sequence of operations (i.e. before the next reshape-transpose 
is carried out) is given by: 

y{log2ifi)) _ pj-{lo92(<:))y(log2(c/2)) ^g j^g-j 

where iv('°92(e)) = 5A/{'°92{c^))5.-i ^ 

At this point we introduce the next restructuring of the data and transform the 
next step of the FFT as: 

(^y(l-09l(2c)) _ gpj-{log2(2c)) i^y(log2(c)) ^g ^g-j 

and the procedure continues in an obvious fashion. 
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3.4.5 ■0-Reduction 

As emphasized in subsection 13 . 3 . 1 1 we never actually materialize (instantiate) any 
n X n matrices. The discussion of subsection l3.3.1l and of the previous subsection is 
merely for the purpose of constructing the derivation. Ultimately we use the tech- 
niques of the t/j-calculus, in particular the technique called "i/)-reduction" to arrive 
at an implementation analogous to that of Fig. 13.21 The process of i/)-reduction is a 
procedure which reduces an algebraic expression to its simplest form by eliminating 
temporary arrays. In effect, all array access operations are effected through direct 
indexing. The application of this procedure to the general-radix FFT is discussed 
in Ref. [19] and the result for the special case of the radix-2 FFT is presented in 
Fig.EJ 

In order to effect the ^-reduction for the present problem we must consider the 
structure of the transformed weight matrices, such as A'''-'' = SM^''"^^^''^^-'^ S^^ (for 
1 < j < log2{c)) in greater detail as presented in the next subsection. 

3.4.6 Structure of the Weight Matrices 

In this subsection, we consider the structure of the weight matrices M'*' and the 
corresponding transformed weight matrices A^'-*' in some detail. As discussed in 
subsection 13.3.11 the matrix M'-'' is a block-diagonal n x n matrix, consisting of 
n/2', 2* X 2*-dimensional blocks -62*. The structure of the block matrices Bl is 
given in Fig. 13.11 Note, each of the block matrices Bl is sparse, having only 2L 
non-zero elements. The remaining — 2L elements are zero. Note, by using ip- 
calculus techniques, we only carry out multiplication operations involving 
non-zero elements. 

The effect of the reshape-transpose operation on the weight matrix M''-* is to 
rearrange each of its sparse diagonal blocks into a series of smaller diagonal blocks. 
Thus through the repeated use of the reshape-transpose operation the 
weight matrix remains banded with a fixed maximum width dictated by 
the parameter c. 

3.4.7 Transforming the Weights 

The 2L non-zero elements in a given block Bl map a specific set of L elements of the 
data array into the corresponding L elements of the updated array. There is thus a 
unique 2 1 homomorphism of the 2L nonzero elements of the array Bl and the L 
elements of the data vector. The 2L non-zero elements of a block Bl are contained 
within four L/2-dimensional sub-blocks (see Fig. I3.1|l . Two of these blocks are the 
L/2-dimensional identity matrix II, (in Fig. l3.1I Lj. = L/2) . The other two blocks 
consist of the L/2- dimensional blocks and —Ql,,- 

The transformation of the weight matrix A^^^' = (i°92(c)-Hj)5.-i ^^^^ ^g^gj._ 
mined by constructing a unique 1^1 isomorphism between the non-zero elements 
of m('°'=2('=)+^) and the elements of the data vector 3;{'°92{2c)+j-i) gj^^^ ^^^^ 
how to transform the data vector y'-*-* = 5'j;('o92(2c)+j-i) ^gjjjg ^jjg reshape-transpose, 
the isomorphism allows us to determine the transformation of the weight matrix. 

The isomorphism is very simple. We begin by constructing a vector ^ of length 
n as follows. The first L/2 elements are unity, taken from the /^^ identity matrix of 
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the first block Bl- The next L/2 elements are the non-zero elements (taken in lexical 
order) of the matrix also from the first block Bl- The next L/2 elements are 
unity taken from the II^ second block Bl, and the next L/2 elements are the non- 
zero elements of Ol^ also from the second block Bl ■ This procedure continues until L 
elements have been extracted from each block Bl to yield the n- dimensional weight 
vector ^. This procedure is completely general and applies to all transformations 
of the weight vector. Before the first reshape-transpose all of the blocks Bl are 
identical. However, as we will see, for all other weight matrices generated by the 
reshape-transpose operation, the sub-blocks Bl will in general be different. 

The vector ^ transforms in the same way as the data vector ^' = S(, to give a 
transformed weight vector. Thus we only need to keep track of the changes to the 
weight vector ^ under the influence of the reshape-transpose operation. We follow the 
procedure to transform ^ exactly as was done to transform x. That is we re-shape it 
into a two-dimensional vector < r c> and observe the re-ordering which occurs 
due to the reshape-transform operation. 

That the isomorphism between ^ and x is correct can easily be seen by the 
following argument. In carrying out one butterfiy operation, we update the data 
vector j: — > x' two elements at a time. For the log2{Ij)-th FFT step, the stride is 
L/2 and therefore elements x{i) and x{i-\-L/2) are combined with the corresponding 
weights in the ^ array as follows 

x{i) = ^(i) * x{i) + i{i + L/2) * x{i + L/2) (3.17) 

and 

x{i ^ L/2) = iii) * xii) - i{i + L/2) * x(i -\- L/2) (3.18) 

Since each element of ^ gets multiplied by the corresponding element of x the two 
vectors must transform together. Again we emphasize that the multiplications by 
unity implied in the first terms of Eqs. 13.171 and 13.181 is only an intermediate step 
in the derivation. In the final implementation there is no multiplication by unity or 
zero. 



3.5 Cache-Optimized FFT Illustrated 
3.5.1 Specific Examples and Patterns 

Consider now the transformed equation A*''^-' as defined by Eq. 13.141 The matrix 
to be transformed, M^'°^^^'^'^^^ is a block-diagonal n x n matrix with n/(2c), blocks 
-B(2c)- Each matrix -B{2c), is thus comprised of two c-dimensional identity matrices 
Ic and two c-dimensional matrices He and —ilc- As indicated in Fig. 13.11 Oc is a 
diagonal matrix with elements (i7c)ii = 0J2c (i-e. the (2c)-th root of unity raised to 
the i-th power) for < i < (c — 1). Thus the elements of the weight vector ^ consist 
of: (1) c entries equal to unity, (2) c, (2c)-th roots of unity: lu2c, ^2c ■ ■ • '^2c ^'i (3) 
c entries equal to unity, etc. Now upon reshaping the weight vector we obtain: 
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< r c > ; 
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1 • 


■ 1 
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• '^2c 


1) 
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1 • 


■ 1 








^2c • 


• ^^20 


1) 



(3.19) 



Returning to our example of the n = 32 array, explicitly, we have: 



< 8 4 > ; 



1 1 1 1 

12 3 
UJg UJg iJg 

1111 

12 3 
UJg Ulg Ulg [jjg 

1111 

(jJg (jJg Uj'g UJg 

1111 

12 3 
,UJS (jJg UJg UJg 



(3.20) 



Now reorder with a reshape-transpose to give 



< 8 4 > p(© < 8 4 > pC) = 



IujI 
leg 

IujI, 
IujI 
lij'i 

loj'i 
iwi 



ojI 

col 

2 

LJg 

uj'i 



la;|. 



(3.21) 



Now by reversing the process that was used to construct the weight vector ^ from 
the elements M^''"^^^'^'^^^ (which is M'^^ in this example) we obtain from the weight 
vector ^ the weight matrix A'^''^^: 
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3.5.2 Structure of Reshape- Transposed Weight Matrix 

Note carefully the structure of A*'''^' given in Eq. 13.221 The equation is now block 
diagonal with 2x2 blocks along the diagonal. Thus the access patterns for the 
matrix multiplication 

yW=iV«y(°\ (3.23) 
using the reshape-transposed quantities are the same as those for 

3.(1) ^j^,f(i)^{0)^ (3.24) 

Now, however, we must realize that the structure of A'^'^-' is more general than that 
of M(i' 

Previously, in the definition of the untransformed matrices M^'-* , a block diago- 
nal sub-matrix Bl was completely specified by its dimension L. Now however, the 
definition of the sub-matrices must be generalized. 

We see in Eq. 13.221 that the first four blocks on the diagonal are equal and the 
weight element tjg = 1 is the same as that we previously found in the definition of 
the sub-matrix Q\ = uj2 ~ 1 (see Fig. I3.1|l . Now, however, there are four different 
2x2 matrices. They are distinguished by their weight elements ujg, uj\, ujg, uig, 
respectively. 

Now we see that the weight matrix, requires three labels: (1) L to determine the 
particular root of unity L (L = 8 in this example) , (2) a giving the power of the root, 
(0, 1, 2, and 3 in this example), and (3) d (= 2 in this example) to determine the 
dimensionality (this in turn, is related to the number of reshape-transpose operations 
which have occurred). We use the symbol i?^"^' ''^ , which is a d/2-dimensional matrix. 
Likewise we generalize the definition of the block sub-matrices Bl P^^' ^'^^ 
example: 
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l-d/2 



(<T, d) 
(<T, d) 



(3.25) 



So far we have only encountered the 1-dimensional (d/2 = 1) weight matrix fi^^' 
The generalization to higher dimensions (which depends on how many times the 
reshape-transpose operation has occurred) will be presented shortly. 

We now summarize the discussion of A^^^-* for the n = 32 case. The block diagonal 
matrix to be transformed, M^^\ consists of four identical 8-dimensional blocks Bg. 
We write: 

M'^') =58 0^8 ©58 0^8, (3.26) 

The transformed matrix, on the other hand, transforms into four groups of four 
different 2x2 matrices: 

^{1) ^ ^(0,2)^^(0,2)^^(0,2)^^(0,2) 



@ Pi'''' 



0^(1,2)^^(1,2)^^(1,2) 

0/3f ''^ 0/3f 0/3^''' 
0/3f ''^ 0/3f ''^ 0/3^''^ 



(3.27) 



We now state the result for the next iterate of the weight matrix A*''^-' = SM^'^^ , 
and then illustrate it in detail. The matrix A''^'^-' is a n x n sparse matrix {n = 32) 
consisting of four groups of two, 4x4 matrices as: 



(0, 4) 



(3.28) 



where the four-dimensional weight sub-blocks are given by 





g(<r, 4) 
Pl6 



TO Lule 

1 

1 -a;f6 
1 







-<5+^ 



(3.29) 



Note that the sequence of matrices p[g' for < a < (c — 1) (c = 4 for this 
example) in Eq. 13.281 contain all of the weights: uji^, lo 
uilg, originally present in the untransformed matrix M''*' 



1 

16 1 



3.5.3 Derivation of AT^^^ From the Weight Vector 

We now explicitly illustrate the construction of the weight matrix A'''^'^^ from the 
corresponding weight vector f . As was discussed for the situation illustrated by 
Eqs. 13.191 and 13.201 we form the weight vector ^ from the elements of M'*-* and 
reshape to give: 
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< 8 4 > 



^16 ^16 ^16 



12 3 

cJi6 oJie 

, ,4 , ,5 , ,6 , ,7 



Next we carry out the reshape-transpose to give: 



< 8 4 > p(® < 8 4 > pC) 



1 1 oj^e 

1 1 UJW 

1 1 ^le 
1 1 ujle 
1 1 ujfe 

1 1 ^^?6 
1 lujff, 

.1 1^?6 



4 
4 

^^16 



Wl6 

7 
7 



(3.30) 



(3.31) 



By reversing the steps that led to the formation of ^ (in Eq. l3.30(l from M^'*' , Eg. 13.311 
expands to give Eq. 13.281 Each row of Eq. 13.311 translates into the corresponding 
4x4 sub-block matrices of Eq. 13:29] 



3.5.4 Last Step 

The last step in this example (n = 32) corresponds to the original weight matrix 
M(5): 

M'"^^ = B32. (3.32) 
Carrying out one reshape-transpose gives: 

(3.33) 

which is a block-diagonal matrix consisting of 8 x 8 blocks. One can see that Eg. 13.331 
is the analog of Eq. 13.261 In order to achieve data locality (i.e. blocks of dimension 
less than or equal to c = 4) we must reshape-transpose once more to yield: 

(3.34) 

which is a block-diagonal matrix with sixteen different 2x2 sub-blocks on the 
diagonal. 



3.6 The General Algorithm 
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3.6 The General Algorithm 

The general pattern for the multiply-restructured weight matrices was discovered, 
by continuing the analysis of the past few sections with larger and more general 
data structures, and is presented in this section. 



3.6.1 Numbering 

There are log2{c) operations before the first reshape-transpose corresponding to 
strides of length = v/2 = 2^~^ , where we allow 1 < p < log2{c). For iteration p 
the block-diagonal sub-matrices of M^*"^ have dimension v x v. 

After the first reshape-transpose the block-diagonal sub-matrices of array 
for 1 < p < loQ'zic) arc of dimension = v/2 = 2''^^ as before, however there are 
now different types of weight matrices which must be specified, in addition to their 
dimension, by a parameter a. 

At this point, the length of the FFT cycle is L = 2^c which also denotes the 
root of unity (e.g. lul, and powers thereof). Because the number of distinct roots of 
unity (i.e. L/2 = 2^''~^-'c ) is greater than can fit in a single vector of length I < c/2 
there are different types of block sub-matrices for a given dimension. There will also 
be copies. 

We find it convenient to specify the cycle length L with two parameters p and 
I where we define L — 2^c' whore for Z = 0, the variable cycles through the values 
1 < P < log2(c). After the first reshape-transpose, 1=1 and we write L = 2^c for 
1 < P < log2{c), after the second reshape-transpose we have / = 2, and L = 2^c^, 
etc. Thus the variable / counts the number of reshape-transpose operations. 



3.6.2 General Weight Sub-Matrices 

After / reshape-transpose operations, the weight matrix is most-conveniently speci- 
fied by including I as an additional parameter. We designate the weight matrix with 
/owr parameters: a (the type of matrix), d its dimensionality, L the root of unity, and 
I the number of reshape-transpose operations which have been carried out. Explicitly 
we have, 



(L, I) 



id/2 

h/2 



(L, 
(L, I) 



(3.35) 



where the j-th component of the (diagonal) matrix i?^' ^ is given by 



(<T + jc') 



(3.36) 



where < j < (d/2 — 1), and the number of different types of matrices is indicated 
by < < (c' - 1). 



3.6.3 Number of Different Weight Matrices 

The number of different weight matrices is determined as follows. We specify the 
total data array length by n = 2^'"""=c''"''°=. With this factorization we have by 
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definition 1 < Pmax < log2(c) We use two parameters to specify the cycle length 

L — 2*'c' whore I counts the number of transposes and 1 < p < log2{c) (indices p and 
/ cycle as inner and outer loops respectively). For all < i < Imax — 1 the variable 
p cycles as 1 < p < log2{c). For I = Imax, we have the restriction 1 < p < Pmax 
Next determine the number of different cycles: 

= - = — =c'. (3.37) 

This is to be interpreted as the number of different weight matrices of dimension 

V X V. For cycle lengths L < c wc find / — and dv = 1 meaning that for this case 
(corresponding to the traditional FFT without reshaping) each matrix is uniquely 
determined by its dimensionality. 

Note also that the number different weight matrices jumps by a factor of c 
for each reshape-transpose operation. For example, after the first reshape-transpose 
operation, we have c different 2x2 matrices corresponding to factoring the cycle 
length L = 2c. For the next FFT iteration we have L = 4c corresponding to p = 2, 
which becomes factored as c different 4x4 matrices etc. After the next reshape- 
transpose operation there are different weight matrices of a given size vxv where 
as before v = 2'' and cycle length L = 2^(? . 

3.6.4 Number of Copies of a Given Cycle Length 

The number of copies of a given cycle length L is given as the ratio of the total data 
vector length n = 2J'"»"»= c'"""" to the cycle length L = vdy-. 

- ^ - ■ ^^■^^> 

To shed some light on this quantity, consider the situation we encounter on the last 
step of the FFT in which L = n. In this 1 and there is only one copy 

of each weight matrix . For example if pmax = 1 then we have the total data 

vector length n = 2c''"''"' and we have each of the c''"""' different 2x2 matrices 
represented only once. If on the other hand we have Pmax = 2 we have n = 40'"*"* 
and each of the c'™"* different 4x4 matrices appears only once etc. 

Next consider the situation in which L = n/2. In this CSiSG Sy 2 and there will 

be two cycles and each matrix /3|^' ^ appears twice. 

The algorithm can be thought of as a generalized FFT algorithm which processes 
vectors of length c. The weight matrix thus naturally partitions into c x c blocks. 
Thus we must consider the total rmmber of copies of a given c x c block. This is 
computed as follows. The length of a given set of copies is vSv = 2*'""'^ c'-'"'""'"'^ 
which is partitioned into blocks of length c is obtained by dividing vSv by c to give: 

= 2^'— 0^''"'"'=-'-'^ (3.39) 



3.7 Cache-Optimized FFT Using the t/j-Calculus 

At this point we turn to a brief look at two key steps in the algorithm: the reshape- 
transpose and the final transpose. These two examples serve as vehicles to illustrate 
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the use of Conformal Computing techniques. The discussion here presents only the 
key results. The complete details of the derivations are presented in the following 
chapter (part II). 

3.7.1 Reshape Transpose Operation 

Recall the expression to abstract our view of the cache. We write: 

<r c> p (®(< r c> p (m))) (3.40) 

to represent the reshape-transpose operation on our restructured data array. Reading 
from right to left we start with the array of indices (tn) which is a one-dimensional 
array (vector) of n sequential integers starting with and ending with n — 1. In the 
present example we only concern ourselves with the manipulation of this index array. 
The first reshaping of (tn) is indicated in Eg. I4.1l bv the expression < r c > p (tn) 
which implies a r x c array consisting of the entries of (tn) taken in sequential (i.e. 
row-major) order. For this example we require the reshaped array to contain the 
same number of elements as the original array: r x c — n. Using the notation of 
the if) — calculus we write tt <r c> = n, where the operator tt acting on a vector 
produces a scalar equal to the product of the elements of the vector. 

In the next step in Eq. 14.11 we apply the transpose operator © to produce the 
c X r array ©(< r c > p ('-w)). In the last step we re-partition the array with 
the < r c > p to produce the r x c array obtained by taking the elements of 
®(< r c > p (tn)) sequentially in lexical order (i.e. row-major). 

3.7.2 -i/j-Reduction of the Reshape- Transpose Operation 

Using Conformal Computing techniques, an Operational Normal Form (ONF) is 
obtained from Eq. 14.11 as is derived in detail in Part II. For the purposes of this 
paper we merely indicate the final result. The ONF is an algebraic specification, in 
terms of indices (i.e. starts, stops and strides), that indicates explicitly how a given 
expression is to be built (in software or hardware). For simplicity we define: 

A = {<rc> p (m)). (3.41) 

Using this notation we express the ONF of Eq. 14. li as: 

y i,j s.t. 0<* <ij> <* <rc> (3.42) 

<ij> tp{< rO'p (®A)) = (rav Q) A)[j + (c x i)]. (3.43) 

The last step that expresses the resuh directly in terms of the elements 
of the array A (as opposed to ©A) is discussed in detail in the next chapter 
(Part II). The result is simple and has been directly translated into C++ code 
as indicated in Fig. 15.31 
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void trans_rslip( complex *datvec, complex *temp,int mnax, 
int csize) 

{ 

int iind , j ind , kind , kmax , jmax , imax , index , c2size ; 

int rows,arg; 

int max(int a, int b) ; 

// The routine carries out the transpose -re shape operation 

index=0 ; 

rows = nmax/csize; 
imax - rows-1; 

c2size = int(pow(csize,2.0)) ; 
jmax = max(0,nmax/c2size-l) ; 



f or ( j ind=0 ; j ind<= jmax ; j ind++) 

{ 

f or(iind=0; iind<=imax; iind++) 

{ 

temp [index] = datvec[jind+csize*iind] ; 
index += 1; 

> 

} 

f or (iind=0; iind<=nmax-l ; iind++) 
{ 

datvecCiind] = temp [iind]; 

} 

} 

Fig. 3.6. C++ code fragment implementing the reshape-transpose in terms of 
index manipulations. 

3.7.3 Reordering the Data 

After the last step of the FFT the data will not be in the correct order 

and wc must do something to return it to its initial order (i.e. prior to any 
reshape-transpose operations). The simplest approach would be to rearrange 
the data by applying a series of inverse reshape-transpose operations. There 
is, however, a far more efficient approach in which no data needs 
to be moved. In other words, we use Conformal Computing techniques to 
determine the index vector which will select the correct components of the 
array. Wc present only the final result in this paper. Complete details of the 
derivation and an in-depth discussion (with examples) is presented in the 
following chapter (Part II.). 

The input data vector y of length n is carried into the vector x through 
the d = log2{n) steps of the FFT. In the process, Imax reshape-transpose 
operations have been carried out. The resulting vector x is thus not in the 
correct order (as a result of the multiple reshape-transpose operations) and 
must therefore be rearranged into its final form ^. Wc now obtain £,hyper as 



^hyper = {{-Cr) (td)) Q) {{<d> p 2) p x) 



(3.44) 



3.8 Results and Discussion 
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where a = Imax log2 (c) and d — log2 (n) and is the rotate operator that 
induces a cychc permutation of the vector L{d) (as wiU be discussed in detail 
in Part II) . The operators acting on the vector x, in Eq. 14.591 represent the 
composite inverse operation of the series of reshape-transpose operations that 
occurred during the FFT. In the final step, we reshape ^hyper into a one- 
dimensional array: 

I =<n> p ^hyper = FFT{y). (3.45) 

The ONF is now expressed as follows. Define two new indices t and s with 
limits given by: 

0<t< t* = (7r((<d - a>)p 2)), (3.46) 

and, 

< s < s* = (7r(<cr> p 2)), (3.47) 
we define a new two-dimensional array by reshaping the vector ^ as: 

^(2) =<s*t*> p ^ (3.48) 

The final result is then written: 

<st> =x[s*2'^-'"+ <i>]. (3.49) 

This expression was directly translated into C+-f code as illustrated in 

Fig. mi 



3.8 Results and Discussion 

The performance results for our new algorithm are presented in Figs. 15.51 
and 15.61 These experiments were run in a single-processor, dedicated, non- 
shared environment on the IBM SP2 machine "squall" at the Maui High- 
Performance Supercomputer center [58] . Specifications for the machine are 
quoted in the caption to Fig. 15.51 

In the first figure (Fig. 15. 5p we plot the time vs. input data length. There 
are two curves, one for our new cache-optimized FFT and one for a similar 
run with no cache optimization. Direct comparisons are possible since both 
curves are produced by the same code. For the non-cache-optimized run, we 
simply chose the blocking size c (specified as a parameter at run time) to 
be greater than or equal to the length of the data vector n. We see that the 
curves have essentially the same shape but the cache-optimized one is shifted 
to the right by one power of two compared to the non-optimized one. Thus 
for a given run time, we can legitimately claim a factor of two speed-up. 

The results presented in Fig. 15.61 emphasize the improved performance for 
a fixed data size by taking the ratio of the run time for the non-optimized 
run to the cache-optimized one. We see that for the some of the largest sizes 
considered, a factor on the order of 4 speedup is achieved. 
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void final_trans (complex *datvec , complex *temp,int nmax, 
int lcap,int csize) 

{ 

/ * dvar = # of 2's in the hypercube 

* leap = # of transpose-reshapes (T-rho) that have to be undone 

*/ 

lege = logCfloat (csize) ) /log(2 . 0) ; 

sig = lcap*logc; 

smax = pow(f loat (2) , sig) ; 

dvar = log(f loat (nmax) ) /log(2 . 0) ; 
tmax = pow(2.0,dvar-sig) ; 

index = 0; 

f or (tind=0 ; tind<=tmax-l ; tind++) 
{ 

f or (sind=0 ; sind<=smax-l ; sind++) 
{ 

temp [index] = datvec [sind*tmax + tind] ; 
index += 1 ; 

} 

} 

f or (index=0 ; index<=nmax-l ; index++) 
-C 

datvec [index] = temp [index]; 

} 

} 

Fig. 3.7. 0+4- code fragment implementing the final transpose bringing the data 
back into the correct order. 

Figure [531 is also enlightening in that it highlights the various levels of the 
memory hierarchy. A change in slope of run-time vs. size indicates the crossing 
of a boundary between one level of the memory hierarchy and another. For 
example from the results of Fig. I5.6l we can make the following estimates. For 
roughly 2^ < iV < 2^ the speed is most likely dominated by the speed of the 
registers. For 2^ < < 2^ the speed is dominated by LI and L2 cache and 
for 2^ < iV < 2^^ the speed corresponds to main memory. For TV > 2^^ paged 
memory (of size AkB) dominates. The large jumps in performance (i.e. factors 
of 4 for the largest sizes) correspond to the presence or absence of page faults. 

In general, the performance of the algorithm is a tradeoff between the 
increased speed obtained by having more data in the cache (with increasing 
c) and the cost of actually moving the data around. One might naively guess 
that the best performance would be obtained by choosing c to be equal to the 
cache size. However, we find, the best performance by choosing blocking sizes 
c given by small powers of 2. In other words, it is more economical to move 
data around many times within the cache than it is to move large blocks into 
and out of cache due to the extreme speed of the cache. That is, direct access 
and movement of components within the cache has no overhead. 
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Fig. 3.8. Comparison of the cache-optimized FFT with our previous i/)-designed 
FFT [T5] indicating reproducible enhanced performance. The data in this figure 
represent the raw timing data obtained by running the experiments in a single- 
processor, dedicated, non-shared environment on the Maui SP2 machine "squall" 
(one of 2, 375Mhz Nighthawk-2, IBM SP2 nodes). Reproducibility was demonstrated 
by comparison of the results of five separate runs which produced nearly identical 
results (not shown) . The slope of the curve reveals the speed of various levels of the 
memory hierarchy as amplified in Fig. 15.61 



3.9 Conclusions 



We have presented a new algorithm for the Fast Fourier Transform that is a 
factor of 2 to 4 times faster than our previous records (that were competitive 
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with or outperformed well-tested library routines as shown in Chap. [T] and 
Ref . [12] ) . This success was achieved through the use of Conformal Computing 
techniques to devise a generalized partitioning scheme leading to optimized 
cache access. The principle is very simple. Data is periodically re-arranged 
so as to always achieve locality in cache. This approach is in contrast to the 
traditional FFT in which data access becomes progressively remote (leading 
to cache misses and page faults) as the algorithm proceeds. The key concept 
in the new algorithm is the repeated use of the reshape-transpose operation 
to move data from remote locations into the cache as needed. A given one- 
dimensional data structure for the input vector is initially reshaped into a 
two-dimensional array of dimension r x c where c is an arbitrary blocking size. 
The blocking size c is completely arbitrary and is specified as a parameter at 
run time. Based on various cost functions (cache speed, cost of moving data, 
etc.) we can predict the performance of the algorithm vs. the blocking size c. 
We find the best performance for blocking sizes c given by small powers of 
2. The results presented in this paper are promising for further developments 
in terms of optimizations of multi-dimensional FFT's over cache in single as 
well as multi-processing environments. 
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Fig. 3.9. Performance enhancement of the cache-optimized FFT as compared with 
our previous ■(/'-designed FFT showing a factor of four enhancement for some of the 
largest sizes. The data plotted in this figure is the ratio of the time for the non- 
optimized routine divided by that for the optimized routine. The changing slope of 
the optimization curve reveals various levels of the memory hierarchy. For roughly 
2^ < N <2*' the speed is most likely dominated by the speed of the registers. For 
2^ < N <2^ the speed is dominated by LI and L2 cache and for 2^ < < 2^^ 
the speed corresponds to main memory. For N > 2^^ paged memory (of size AkB) 
dominates. The jumps in performance correspond to the presence or absence of page 
faults. 
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A Cache-Optimized Fast Fourier Transform: 
Part II 



4.1 Chapter Summary 

This chapter explores in detail two key steps of a new cache-optimized Fast 
Fourier Transform algorithm that was presented in the previous chapter (see 
Chap. [3]). Through the use of Conformal Computing techniques, discussed 
herein, an impressive performance improvement (factors of 2 to 4 speed-up) 
was obtained. The present chapter serves as a tutorial introduction to the 
techniques of Conformal Computing: a systematic design methodology for 
hardware/software algorithms based on a rigorous mathematical theory. Two 
key aspects of the new algorithm, the reshape-transpose and final-transpose 
are explored in detail, and serve as vehicles to introduce and illustrate many 
key aspects of the theory. Although our approach is based on a rigorous for- 
mal theory, the net result is always an efficient specification, the Operational 
Normal Form (ONF) for how to build a particular algorithm in software or 
hardware in any programming language. As we explicitly demonstrate, the 
ONF for each of the operations {reshape-transpose and the final-transpose) 
are directly translated into computer code. 

4.2 Introduction 

This chapter continues the discussion of the cache-optimized FFT presented 
in the previous chapter (Part I: see Chap. [3]) and develops the techniques of 
Conformal Computing in some detail. Over the past decade, these techniques 
have been successfully applied to a number of algorithms that are ubiquitous 
across science and engineering disciplines, such as the Fast Fourier Transform 
(FFT) [16l[l9l[lIl[T8l[59], LU decomposition [60], matrix multiplication, Time 
Domain convolution, QR decomposition [20l[6T], etc. That is to say, these and 
other algorithms were first expressed algebraically using MoA then tp-Teduced. 
These designs were realized in both hardware and software [62l [63l [64l [65l [66l 
[611 [681169]. 
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At this point we turn to a detailed look at two key steps in the new cache- 
optimized FFT algorithm: the reshape-transpose and the final transpose. 

4.3 Reshape Transpose 
4.3.1 Algebraic Specification 

Recall the expression to abstract our view of the cache. We write: 



to represent the reshape-transpose operation on our restructured data array. 
Reading from right to left we start with the array of indices (tn) which is a 
one-dimensional array (vector) of n sequential integers starting with and 
ending with n — 1. In the present example we only concern ourselves with 
the manipulation of this index array. The first reshaping of (tn) is indicated 
in Eq. 14.11 by the expression < r c > p (m) which implies an r x c array 
consisting of the entries of (m) taken in sequential (i.e. row-major) order. For 
this example we require the reshaped array to contain the same number of 
elements as the original array: r x c — n. Using the notation of the '0-calculus 
we write n <r c> = n, where the operator tt acting on a vector produces a 
scalar equal to the product of the elements of the vector. 

In the next step in Eg. I4.1l we apply the transpose operator to produce 
the c X r array (^{<r c> 'p {in)). In the last step we re-partition the array 
with the < r c > p to produce the rxc array obtained by taking the elements 
of (J)(< r c > 'p {in)) sequentially in lexical order (i.e. row-major) . 

4.3.2 ■0-Reduction: Denotational Normal Form and Operational 
Normal Form 

As we will see in the following, all of the operations in Eq. 14.11 are composed 
to yield one final expression through the use of direct indexing. The process 
of converting an expression such as that in Eq. 14.11 into one involving only 
indexing operations is called ip- reduction. The first step is to produce the 
Denotational Normal Form (DNF) which reveals the semantic meaning of a 
reduced array expression such as Eq. 14. li in terms of Cartesian coordinates. 
For example, given an array A, we write: 



where we've introduced the ip operator which takes an index vector <i j> and 
extracts the i,j-th element of the array A. On the right hand side of Eq. 14.21 
we use a common bracket notation to denote a component of an array. For 
an expression involving a number of operations, such as the one in Eq. 14. 1[ 
the DNF is obtained by composing indices using the tp operator and an index 



<r c> p (©(< r c> p (m))) 



(4.1) 



<ij>ijA = A[i,j], 



(4.2) 
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vector by applying the definitions of the various operations (p , ©, etc.) 
as win be demonstrated shortly. In essence, we view each operator ( p , ©, 
etc.) as effecting a certain re-arrangement of the index vector. Obviously such 
re-arrangements can be performed sequentially to find the re-arrangement 
corresponding to the composite operation. The resulting expression, involving 
only the starting data array (i.e. the array (tn) in this example) and Cartesian 
coordinates is, by definition, the DNF which is independent of layout. 

The final step of the ip- reduction process results in the Operational Nor- 
mal Form (ONF) which describes the underlying access patterns of the array 
operation in terms of the specific data layou10 on a computer. Explicitly, for 
the simple example of Eq. 14. 2i we write: 

<ij> VA= (ravA)[7^o^(< i j >; (p^))], (4-3) 

if the layout is row major, and 

<ij>iljA={ rav A)[7co/.„„„(< i j >; (pA))], (4.4) 

if the layout is column major. Equations 14.31 and 14.41 are examples of ONF's 
corresponding to the DNF of Eq. 14.21 By necessity, the ONF is dependent on 
the underlying data layout as are the operations reshape p and ravel (rav). 

In Eqs. 14.31 and 14.41 the operation rav (which we call ravel) produces 
the one-dimensional vector ( rav A) from the elements of the array A. The 
ordering of the elements of (rav A) depends on the layout [row-major vs. 
column- maj or) . For an r x c array A in row-major order, the first c elements 
of (rav A) are the elements of the first row, the next c elements of (rav A) 
are taken from the elements of the next row etc. In a similar way, if A is an 
r X c array, in column-major order, the first r elements of (rav A) are taken 
from the first column of A, the next r from the second column of A, etc. 

In Eqs. 14.31 and 14.41 we have also introduced the layout functions, 
7coZiimn(< i j >', (pA)), and jrowi< i j (pA)) which produce scalar indices 
in order to extract a single element from the one-dimensional array ( rav A) . 
In general there will be a family of layout functions for various situations. 
Equations 14.31 and 14.41 are each, therefore, of the form 

< i j > ipA = {ra.v A)[index], (4.5) 

where index is a scalar, and we have again used the bracket notation (as is 
done in C + +) on the right hand side of Eqs. 14.31 14.41 and 14.51 to denote the 
use of a scalar index to extract a component of a one-dimensional array. 

The indexing functions jrow and jcoiumn take two arguments: (1) the index 
vector < i j > and (2) the shape vector (pA) . The shape vector (pA) is a vector 
consisting of the lengths of the various dimensions of the array. For the present 
example, given an r x c array A, we have: 



^ In this context, layout means row major, column major, regular sparse, etc. 
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{pA) ^<r c> (4.6) 

The function 7 was given in Definition 12. II and in all further discussions 
we will assume row-major ordering and drop the row subscript: jrow — > 7 

4.3.3 Prom the ONF to the Generic Design 

The ONF describes how to build the code independent of a particular pro- 
gramming language. At this point all loop nests are revealed, and all data 
flow and memory management arc indicated. Each loop nest indicates not 
only access patterns but levels of the processor / memory hierarchy. We use V 
(for all) as our generic loop indicator. Similarly, bracket notation indicates 
"address of" so that when indices are calculated they are calculated relative 
to the address of the indicated array (i.e. the address of (pointer to) the first 
element). All of this in conjunction with 7 gives us all essential information 
to build the design in any programming language at both the hardware and 
software levels. For example, V i,j s.t. < i < 2, 0<j<4 

(ravA)[7(< ij>; < 2 4 >] = (ravy4)[j + (4 x i)] (4.7) 

denotes a generic form with two loops with @A + j + (A x i) in the body. 
At this point a mechanization to any language is possible. Similarly, loops 
indicating message passing, shared memory access can easily be instantiated 
by whatever libraries support that loop level. 

4.3.4 i/j-Reduction of the Reshape- Transpose Operation 

We now return to the problem of i/'-reducing the reshape-transpose operation 
of Eq. 14.11 To illustrate this example we choose n = 32, r = 8 and c = 4, as 
was done in earlier discussions. 

Step I: Determine Shape 

The first step in the -ip- reduction process is the determination of the shape 
vector which, for this example, is given by: 

(pA) =< r c>=< 8 4 > . (4.8) 

This step is crucial in order to enforce the use of valid index vectors. In order 
for an index vector < i j > to be valid, its components must not exceed the 
lengths of the corresponding dimensions of the shape vector. Specifically we 
say that a valid index vector satisfies: 

0<* <ij> <* <rc>, (4.9) 

where the notation <* and <* implies component- wise comparison of the two 
vectors < i j > and < r c >. 
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Step II: Perform Psi-Reduction and Reduce to Normal Form(s) 

We begin by taking Eq. 14.11 apart using the V' operator and an index vector 
<i j>. To simplify the notation we define the quantity within the innermost 
set of parentheses in Eq. 14. li as A = <r c> p (m). By applying the definition 
for reshape: V i,j s.t. <* <ij> <* <r c>, we get: 

<i j > ip{< r c> p (©A)) = ( rav (©^))[7(< i j > ; < r c >) mod (7r(<r c>))] 

= (rav(©A))[7(< i j > ; < r c>)] 
= (rav(©yl))[j + (ixc)] (4.10) 

In Eq. 14. 10) we are applying the definition of the reshape operator to the object 
rav (©A) which is defined in terms of the i j > ; < r c >) function. In the 
first line of Eq. 14.101 the expression: mod(7r(<r c>)), is part of the definition 
which handles the case in which one wants to reshape a smaller array into a 
larger array from the elements of the smaller array (repeated appropriately). 
In this case, however, the total number of components in the reshaped array is 
the same, allowing us to drop the expression, mod(7r(<r c>)), in the second 
line of Eq. 14.101 In the third line of Eq. 14.101 we have inserted the explicit 
expression for 7 from its definition (see Definition I2.ip . 

Next we wish to get rid of the transposed array (^A in favor of the original 
array A. We consider selecting an element of using an index < i' j' > as 
follows: 

V s.t. 0<* <i' j'> <* <cr> (4.11) 

<z'/>^(©A) = (rav(©A))[7(<z'/>;<cr>)] = (rav(©A))[/ + z'*r]. 

(4.12) 

In order for this expression to agree with Eq. 14.101 we must equate the argu- 
ments in square brackets in Eqs. 14.101 and HTT^ as: 

j' + i' * r = j + i * c. (4-13) 

From this wc find the the primed indices in terms of the unprimed indices 
(assuming r > c for this example) as: 

= int(c * i/r), (4-14) 

and, 

j' = {j + i*c) mod r. (4-15) 
Next we use the definition of transpose to simplify Eq. 14.121 

<i'j'>VX©-4) =<j'i'>^ljA^ {ra.vA)[i'+j'*c]. (4.16) 

Thus by using Eqs. 14.141 and l4T5l in Eq. l4.16l we finally express Eq. [iTOj com- 
pletely in terms of A. Considerable further simplification is possible, however, 
as is discussed in the next section. 
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Step III: further Simplification 

The formulation just presented, while formally correct, is computationally 
inefficient. In particular, we wish to avoid the operations int and mod in 
Eqs. 14.141 and 14.151 Instead, if we need to work with both the primed 
and un-primed indices we can use the loop structure (expressed in C+-|- 
syntax assuming r > c for this example) given in Fig. 14.11 

i = 0; 

ratio = r/c; 

f or (iprime=0 ; iprime < c; iprime++) 
{ 

for(k=0; k < ratio; k++) 
i 

for(j=0; j < c; 

jprime = j + k*c; 

y 

i = i + 1; 

y 

y 

Fig. 4.1. Efficient loop structure to produce both sets of indices <i j> and < i' j' > 
assuming r > c. A similar loop structure is easily constructed for the case c> r. 

In our applications, further simpUfication occurs. Note, in Eg. 14.101 

a single element is extracted from the transposed array ^A. In this case, as in 
many situations, we can compute the entire 0^ and select elements from itH 
This is most easily computed by looping over the values of the index < i' j' > 
in the rightmost expression in Eq. 14.161 
Thus, to construct the array 

{<rc> p (4.17) 

for example, we only need the elements of the corresponding one-dimensional 
array: 

rav(< r c> p (®A)) = rav QA, (4.18) 

constructed from Eq. 14.161 

This computation was implemented in C++ as illustrated in Fig. 14.21 Note 
in Fig. l4.2l the two for- loops correspond to the bounds as indicated in Eq. l4.11l 
upon making the substitutions: 



^ Bear in mind: in our cache-optimized FFT we actually materialize ©A so as to 
achieve data locality. As such it is most efficient to compute ©A all at once. 
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i' j (variable jind in the code) (4.19) 

and, 

j' i (variable iind in the code) (4.20) 

This substitution is made to simplify the notation and is allowed because: 
(1) the bounds on < / i' > (see Eq. I4.11|) coincide with those on < z j > 
(see Eq. 14. 9p and, (2) we are constructing the entire array and we need 
not keep track of the explicit relationship between the primed and unprimed 
indices indicated in Egs. 14.141 and HTSl 

Note also that the argument of datvec on the right hand side of the as- 
signment temp [index] = datvec [j ind+csize*iind] ; in Fig. 14.21 is a direct 
translation of the argument that appears in Eq. 14.161 upon carrying out the 
substitutions of Eqs. I4l9l and [4201 



void trcins_rshp(complex *datvec , complex *temp,int nmax, 
int csize) 

{ 

int iind, j ind, kind, kmax, j max , imax, index , c2size ; 

int rows,arg; 

int max C int a, int b) ; 

// The routine carries out the transpose-reshape operation 

index=0 ; 

rows = mnax/csize; 
imax = rows-1; 

c2size = int (pow(csize ,2 . 0) ) ; 
jmax = max(0,mnax/c2size-l) ; 



f or ( j ind=0 ; j ind<=jmax ; j ind++) 
-C 

f or Ciind=0 ; iind<=imax; iind++) 
{ 

temp [index] = datvec [j ind+csize*iind] ; 
index += 1 ; 

> 



f or Ciind=0 ; iind<=nmax-l ; iind++) 
-C 

datvecEiind] = temp[iind]; 

} 



Fig. 4.2. C+H- code fragment implementing the reshape-transpose in terms of 
index manipulations. 
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4.4 Reordering the Data 

After the last step of the FFT the data will not be in the correct order 
and we must do something to return it to its initial order (i.e. prior to any 
reshape-transpose operations). The simplest approach would be to rearrange 
the data by applying a series of inverse reshape-transpose operations. There 
is, however, a far more efficient approach in which no data needs 
to be moved. In other words, we use Conformal Computing techniques to 
determine the index vector which will select the correct components of the 
array. 

4.4.1 Final Transpose 

We now seek to answer the following general question: after a single reshape- 
transpose operation, where is the data? More precisely we are interested in 
the re- arrangement of the corresponding index vector. The answer to this 
question was revealed to the authors upon viewing the data vector as an 
log2 (n)-dimensional hyper- cube. The hyper-cube is formed by reshaping the 
data into an 2 x 2 x ... 2 array where the number of 2-s is given by log2(ri). An 
element of the hyper-cube is then determined by specifying an index vector of 
length log2(n) consisting of ones and zeros. There is, therefore, an isomorphism 
between the index vector of an element and the binary number corresponding 
to the index of the original one-dimensional array. 

For example, suppose n — 8, then the original data vector would be: 

A = t(8) =< 1 2 3 4 5 6 7 > . (4.21) 

Next we re-shape this vector into a hyper-cube by performing the following 
operation 

Ahyper = i3p2)p A. (4.22) 
The operation in parentheses yields 

(3p2)=<2 2 2>, (4.23) 

which is the argument to the second reshape operator. Thus more explicitly. 
Eg. 14.221 is written as: 

Ahyp,r =<2 2 2> p A, (4.24) 

which is a 2 X 2 X 2 array constructed by taking the elements of A in increasing 
order. 

Incidentally, the operation of Eq. 14.231 is an example in which a smaller 
array (in this case the scalar 2) is reshaped into the larger array < 2 2 2 >. Such 
a situation was anticipated in the definition of the reshape operator by the 
presence of the modulo operation (e.g the factor mod(7r(r x c)) in Eq. I4.10p . 
as was discussed earlier in the context of Eq. 14.101 
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The elements of the hyper-cube Ahypf.,. are now selected by specifying an 
index of zeros and ones corresponding to the binary representation of the 
original array. For example: 



<0 1 0> t/jAhyper = A[3] = 3, 
< 1 1 > IpAhyper = A[5] = 5, 



(4.25) 
(4.26) 



etc., where we have again used the bracket notation to denote selection of 
elements of one- dimensional arrays. 

Unfortunately, there is no truly satisfactory way to visualize the hyper- 
cube construction. However, for the purpose of discussion, we have adopted 
the following convention. We begin by grouping four elements together at a 
time and write them as 2 x 2 matrices surrounded by square brackets. For 
example suppose we have a 4-dimensional hyper-cube B defined by 



B = {4p 2)p t(16) 
the first four elements of B would be written as: 



<00>tpB 



1 
2 3 



(4.27) 



(4.28) 



where we have used a partial index < > to select the first four elements of 
B: < > V^, < 1 > VS, < 1 > VS, < 1 1 > and we 
have written them as a two-dimensional array (assuming row-major order), 
the next four would be grouped together as: 



followed by 



and lastly: 



<0 1> 



<10>i;B 



< 1 1 > V-B = 



4 5 
6 7 



8 9 
10 11 

12 13 
14 15 



(4.29) 



(4.30) 



(4.31) 



The entire 2x2x2x2 array B can thus be visualized by arranging the four 
2x2 blocks as a 2 X 2 array of 2 x 2 arrays and enclosing them in parentheses 
as: 

<0 0> TpB <0l> ^B 
<1 0> ijB <11> -ipB 



B = 



(4.32) 



The generalization to higher dimensions is straightforward. In the present 
example we are assuming the hyper-cube B to be of even dimension (i.e. 4). 
For an odd-dimensional hyper-cube, we treat the final index (on the left) as 
defining a column index. 
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For example, consider the array x^'') defined to be the last step in the FFT 
for the n — 32 example given in Eg. 13.81 in Chap. [31 



^ = ((log2 n)p2)px 







'0 16' 




'2 18' 








'8 24' 




'10 26' 










1 17 




3 19 








9 25 




11 27 










"4 20" 




"6 22' 








'12 28 






'14 30' 










5 21 




7 23 








13 29 




15 31 







(4.33) 



Our task in the next section is to correctly identify an index vector which 
rearranges Eq. l4.33] into the correct order (i.e. ascending integers starting with 
0) without actually moving any data. 



4.4.2 General Rule 

By studying the patterns induced by the reshape-transpose operation on arrays 
arranged as hyper-cubes (such as that in Eq. l4.33p . the authors have discovered 
a rule for returning a hyper-cube of any dimension (e.g. an input data vector 
for an FFT of any length) to its correct form in one step through the use of 
indexing, without the need to actually rearrange the data. We present and 
discuss this rule in this section. The rule is simply stated as follows. 

Given an array A, that has been rearranged with the reshape- 
transpose operator to give B, we select an element of B with a 
binary index p of the array formulated as hyper-cube. The element, 
so selected, is obtained from the array A (also formulated as a hyper- 
cube) with the index q where q is simply a cyclic permutation of 
P 

At this point, we present an example to clarify the situation. Consider a 
4x4 array of integers starting with and ending with 15 which is written as 



A=<AA> p i(16) 



12 3 
4 5 6 7 
8 9 10 11 
12 13 14 15 



(4.34) 



Now define the array B obtained through a reshape-transpose operation acting 
on A: 



B=<44> p (®A) 



4 8 12 

1 5 9 13 

2 6 10 14 

3 7 11 15 



(4.35) 



Incidentally, for the special case of a square array (as we are considering here) 
the reshape-transpose operation is equivalent to the transpose operation. 
Let us now reshape the arrays as hyper-cubes, that is: 
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Ahyper = (4 p 2) p A 2 2 2 2 > p A, (4.36) 

and, 

Bhyper = (4 p 2) p S 2 2 2 2 > p B, (4.37) 

Now if we consider the relationship between the elements of A and B 
viewed as hyper-cubes we find the following behavior: 

<i j kl> IpBhyper =<kli j> TpAhyper- (4.38) 

For example: 

< 1 > yjBhyper 1 > Mhyper = 8, (4.39) 

< 11 > ijBhyper 1 1 > Mhyper = 6, (4.40) 

< 1 1 1 > ll^Bhyper =< 1 1 1 > tl)Ahyper = 7. (4.41) 

The general rule is as follows. Suppose there are d ~ log2(ri), 2's in the 
hyper-cube representation of an r x c array (with n — rxc). Upon transforming 
A into B via the reshape-transpose operation, 

B = (<rc>) p (©A), (4.42) 

we find the following relation between the hyper-cube representations of A 
and B : 

pipBhyper = q-ipAhyper, (4.43) 

where the relationship between indices p and q will be discussed shortly. In 
Eq. |433] j4/jj,per and B^yper are defined by: 

Ahyper = (log2(n) p 2)p A, (4.44) 

and, 

Bhyper = (log2(ri) p 2)p B. (4.45) 

To complete the specification of Eg. I4l43l we need to determine the index q 
in terms of p. In general, q is a cyclic permutation of p in which log2(c) 
elements of p are sequentially removed from the left and placed at 
the right. For example, if d — log2(32) = 5, and log2(c) = 2, for 

p =<i j k I m>, (4.46) 

indexing the array Bhyper, the corresponding index of A^yper is given by: 

q =<k I mi j> . (4.47) 

Using the formalism of the V'-calculus, we rewrite Eg. 14.471 as 

q =< kl mi j >= {2 (fi <i j k I m>) = p, (4.48) 
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where we have introduced the rotate operation (p. Naturally we are also inter- 
ested in the inverse operation which is written as 

p =<i i k I m>= (-2 (f)<klmij>)=c[ (4.49) 

Now that we understand the relationship between A^yper and B^yper we 
are in a position to specify how B can be re-ordered to correspond to the or- 
dering of A/ij^per- In order to do that we invoke the definition of the generalized 
transpose operation 

Up to this point we have only used in its traditional manner for two- 
dimensional arrays. For an d-dimensional array, such as A^yper or B hyper the 
transpose operation invokes a permutation of the dimensions which is specified 
by a permutation vector as its left argument (by convention no left argument 
is required for two-dimensional arrays). Thus, because we found the following 
relationship between the components of Ahyper and Bhyper 

<i j kl m> i>Bhyper =<klmij> Ahyper, (4.50) 

we say that Bhyper is related to Ahyper through the following generalized 
transpose: 

Bhyper =< 2 3 4 1 > (^Ahyper, (4.51) 

and invoking the inverse of this transpose, we can write 

Ahyper =<3 4 1 2> (^Bhyper- (4.52) 

In anticipating further developments, we use the rotate operator to write 
the index vectors more abstractly in terms of the l operation. Specifically, 
Eqs. 14.51] and respectively become: 

Bhyper = ((2) (/) (t5)) Q} Ahyper (4.53) 

and, 

Ahyper = ((-2) (/) (t5)) Q) Bhyper (4.54) 

The right sides of Eas. l4.5l] and l4.52l can be further abstracted by substituting 
the definitions of Ahyper and Bhyper from Eqs. 14.361 and [47371 to yield 

Bhyper = ((2) (t5)) Q {5 p 2) p A (4.55) 

and, 

Ahyper = ((-2) (ib)) Q {5 p 2) p B (4.56) 

We now consider a general r x c array A having n = r x c components. 
Now define the array B to be that which is obtained from A through the 
application of Imax reshape-transpose operations: 



B = {<rc> p Q)'""-A 



(4.57) 
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we interpret the operator (<r c> p Q^max niean the operation (<r c> 
p (5) ) carried out l^ax times. Based on the principles introduced so far, the 
elements of B are related to the elements of A through the relation: 

Bhyper = ((a) </> (id)) Q (d p 2) p A, (4.58) 

where a = Imax log2(c) and d = log2(n). 

For example, the hyper-cube ^ considered previously in Eq. 14.331 cor- 
responds to Eq. 14.581 with c — 4, n = 32, Imax = 2, d = 5, cr = 4, 

Ahyper ^ {5 p 2) p A whcrc A =<8 4> p (t32) 

4.4.3 t/j-Reduction 

Building on the developments of the previous section, we now show how to 
effect the final rearrangement of the FFT. In other words, we wish to find the 
inversion of an equation of the form given in Eq. 14.581 This is easily accom- 
plished by simply changing the sign of the variable a as was demonstrated in 
Eqs. l4.53l and l4.54l a fact which nicely underscores the power of the Conformal 
Computing approach. 

The input data vector y of length n is carried into the vector x through 
the d = log2{'n) steps of the FFT. In the process, Imax reshape-transpose 
operations have been carried out. The resulting vector x is thus not in the 
correct order (as a result of the multiple reshape-transpose operations) and 
must therefore be rearranged into its final form ^. We now obtain bhyper as 

bhyper = ((-^i) 4' (^d)) G {i<d> ? 2) p x) (4.59) 

where a = Imax log2(c) and d = log2(n). The operators acting on the vector x, 
in Eq. 14.591 represent the composite inverse operation of the series of reshape- 
transpose operations that occurred during the FFT. The final step is obtained 
by reshaping S,hyper hito the final one-dimensional array: 

i=<n> p Cyper = FFT(x). (4.60) 

The derivation to normal form follows: 

Step I: Determine Shape 

We now wish to carry out the process of T/i-reduction of an expression of the 
form: 

ii-a)^{id))Qii<d> p^)p^) (4.61) 

The first step is to find the shape (p) in order to enforce the use of valid 
indices. The shape is given by: 



p{{-a) (j){Ldj)Q){{<d> p 2)p x) = {<d> p 2) (4.62) 
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which shows that permuting the elements of the shape vector does not change 
the shape of a 2-cube (i.e. a d-dimensional hyper-cube. This is true in general 
for a d-dimensional n-cube). More explicitly, the vector of Eg. 14.621 is a vector 
consisting of d entries each given by the integer 2. Such a shape implies that the 
expression in Eq. 14.611 is a d-dimensional array, the length of each dimension 
being precisely 2. In other words it is a d-dimensional hyper- cube. 

Step II: Perform Psi-Reduction and Reduce to Normal Form(s) 

For an index i to be valid it must satisfy: 

<* i <* (<d> p 2), (4.63) 

which states that all of the components of i must be less than the correspond- 
ing components of the shape vector (i.e. the index vector must consist of O's 
and I's). 

We now use a valid index vector i to select an element of the expression 
given in Eq. 14.611 By doing so we can apply the definition of the transpose 
(which is defined in terms of the corresponding rearrangement of the index 
vector). For the moment we simplify the notation with the definition: 

ri={<d> p 2) p X. (4.64) 

We thus obtain: 

i^(((-a) (id)) Qr^) = ii-a) </) i))^?? (4.65) 

which shows that selecting an element of the transposed array with an index 
i is the same thing as selecting an element from the non-transposed array rj 
using an index ((— cr) (f> i)) that has been permuted. This is, in essence, the 
definition of the generalized transpose operation. 

Now we further reduce the form of the index vector as: 

ii-a) i)) ^ (((-a) A i) ^((-a) V i))- (4.66) 

which explicitly denotes the way in which ((— cr) (j) i)) is built from two frag- 
ments of i. We have introduced the operations take A and drop V which 
select sub- vectors from i. Specifically, ((— cr) A i) forms a vector from the last 
(i.e. rightmost) a elements of i and ((— c) V i) is the vector which remains 
after dropping the last a elements of i. In the expression on the right hand 
side of Eq. 14.661 we have introduced the operation cat + which concatenates 
the two fragments together. Thus Eq. 14.651 becomes: 

i^(((-a)0(.d)) Ov) = (((-a) A i) ^((-a) y 1))^-^. (4.67) 

To make further progress, we now rely on the concept of a partial index. 
For example, suppose we have a three dimensional array Z. An individual 
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component is specified with a valid full index containing three elements such 
as: 

<klm>il>Z. (4.68) 
However, we can also extract sub-arrays of Z by using partial indices. Thus: 

<fc>VZ, (4.69) 

is a two-dimensional sub-array, while 

<kl>ipZ, (4.70) 

is a one-dimensional array. Therefore, the result of selection with a full index 
can be written as a composition, such as: 

<k I m> tjjZ =<l m> %l:{<k> ipZ) =<m> ip{<k l> i^Z). (4.71) 

Thus the right hand side of Eg. 14.671 becomes: 

{{{-a) A i) ^((-a) V i))^'?) = {{-^) V m{{-<^) A (4.72) 

The next step is to apply the Psi Correspondence Theorem^ which specifies the 
manner in which to construct indexing functions (such as the 7's introduced 
earlier) from arbitrary full or partial index vectors. 

4.4.4 The i/j Correspondence Theorem (PCT) 

We now state the tp Correspondence Theorem algebraically and then pause to 
take it apart piece by piece. 

Definition 4.1. Civen an array ^ with shape p^, Vj s.t. (rj) < (S^), and 
0<*j<* ((rj) A (pO) 



rav(M) = (ravOWJKW) A (pO)) * Ai^) V (pO) + ^(^((rj) V (pO))]- 

(4.73) 

The operators tau (r) , and delta (6) in Eq. 14.731 give the number of elements 
of a vector, and the dimensionality of an array, respectively. 

Consider first, the left hand side of Eq. [57751 The expression (jV'C) is a sub- 
array of ^ obtained using the (partial or full) index j. For example, suppose 
^ is a three dimensional array with shape =< 4 6 7 >, and j —<2>, then 
we have the two dimensional array: j^/'C 2 m n > ijj^ for valid indices 
< m < (pC)[l]: and < n < (pOP]- In this example {pOW = 6 and 
(p^)[2] = 7. For this example, we have four such arrays since (/9C)[0] = 4. 

Generally the convenient bracket notation for a one-dimcnsional vector A 
has the following equivalence: 
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A[r] =<r> i)A. (4.74) 

for some scalar < r < (pA)[0]. Often we use standard vector notation 
{e.g. A) for emphasis. We stress, however, that the consistent structure of 
the algebraic system is designed in such a way that scalars and vectors are 
simply zero- dimensional and one- dimensional arrays and as such don't require 
any special symbols. For convenience, however we often use such symbols as r 
to indicate a scalar and O =<> to denote the empty vector. The existence of 
the empty vector is necessary in order that the scalar r has the correct shape 
p(r) —<>. The existence of the empty vector may seem strange to some 
readers. Its use, however, is necessary in order to have a consistent algebra. 
The need for the empty vector is analogous to the need for the zero in the 
set of integers and the empty set in set theory. 

Note a related aspect of the theory is often misunderstood by newcomers: a 
vector < r > is NOT equivalent to a 1 x r array (i.e. a row-vector in traditional 
matrix theory) NOR is it equivalent to a r x 1 array (i.e. a column-vector in 
traditional matrix theory). In the present theory, these three objects each have 
different shapes, namely, <r>, <1 r> and <r 1> respectively. In traditional 
matrix theory, the distinction plays no essential role. 

Continuing our analysis of Eq. I4.73i on the left hand side, the operation 
ravel ( rav ) takes its argument, the sub-array jV'C flattens it into the one- 
dimensional array rav (jip^,). On the right hand side of Eq. [47731 the expression 
is written in the form ( rav^)[r -t- a ] where r is a scalar offset which is added 
component-wise to the vector a to produce the vector index b = r -|- a. The 
components of the index vector b are integers which select the components of 
^ in lexical order. Explicitly the scalar r and vector a in Eq. 14.731 are given 
by: 

r = 7(j; ((rj) A (pO)) * (^(W) V (pO)) (4-75) 

and, 

a = .(^((rj) V ipm (4.76) 

respectively. These expressions will be described in detail shortly. First we 
continue the example of a three dimensional array introduced above. 
In the example considered above, 

rav(j^^) = rav(<2 m n> ^^), (4.77) 

is a one dimensional array consisting of the following elements: < 2 > i/'C , 
<2 l>ipC, <2 1 0> <2 1 1> VC, <2 2 0> V^, <2 2 1> V'f, etc. 

Now the significance of r reveals itself. It is the index of the first 
element of our sub-array jtp(_. Likewise the vector a is a vector of 
integers starting with which serves as a vector of regularly spaced 
offsets. The total number of integers in the vector a is equal to the number 
of elements in the sub-array jipC- 

The shape vector naturally partitions into two sub-vectors given by 
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((rj) A {pO), (4.78) 

and, 

(W) V (pO)- (4.79) 

In the first expression rj counts the number of elements of the partial index 
j and the take operation ( A ) forms a vector of length equal to that of the 
index vector composed of the first rj elements of the shape vector (i.e. the 
rj leftmost elements of p^). The second expression forms the corresponding 
vector of the remaining elements. The product of the elements of 

(W) V (pO) (4.80) 
gives the total number of elements of the sub-array j?/;^ and is written as: 

n = ^(((rj) V (pO)) (4.81) 

Now we complete the description of the vector of offsets a by using the 
iota operation t to create the vector of integers a = i(n) as given in Eq. 14.761 

Likewise, the starting index r has a simple explanation. In Eq. I4.75[ the 
expression 

7(j; ((rj) A (pO)) (4.82) 

counts the number of sub-arrays that precede the one of interest ji/>^ in the 
array ^. Thus, in order to obtain the index of the first element of our chosen 
sub-array jijj^ we multiply by the total number of such elements, given in 
Eq. 14:811 to give Eq. [4J5l 



4.4.5 Applying the ip Correspondence Theorem 

We now apply the PCT, developed in the previous section, to the expression 
given in Eq. 14.721 We begin by writing the right hand side of Eq. 14.721 in the 
form required by the theorem: 

ii-a) V i)^(((-fT) A i)7A77) =j7AC (4.83) 

where the partial index j of the PCT has been defined to be 

j = ii-a) V i), (4.84) 

and the array ^ of the PCT has been taken to be the subarray: 

e = ii-a) A i)^7^. (4.85) 

We now work out the various quantities appearing in the PCT. The index 
i is chosen to be a full index for the array 77. From Eqs. 14.621 14.631 and 14.641 
we find the shape of the index i to be: 



p{i) =<d> 



(4.86) 
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From Eg. 14.841 we find the shape of j to be: 

p{i)=<d-a> (4.87) 

which follows from the definition of drop y as applied to Eq. 14.841 Formally 
we say: 

T(j) =d-a, (4.88) 

where the tau operator counts the number of elements in the vector j. Next 
from Eq. 14.851 we find the shape of ^ to be 

PiO = p{ii-<^) A i)V^77) =<d-a> p 2, (4.89) 

which shows explicitly that the index vector j is a valid full index for the 
sub-array ^ as required. 

Next we compute the following quantity appearing in the PCT: 

rii) V (pO = {d-<7)s/ {<d~a> p 2) =<>= 0. (4.90) 

We obtain the empty vector because the drop operation y is dropping 
d — a elements from a vector that contains precisely d — a elements. In the 
next step, the product operator tt acts on this quantity to give: 

Hr{i) V (pO) - HO) = 1. (4.91) 

In general, the product operator tt multiplies the elements of a vector to obtain 
an scalar quantity. The equivalence on the right hand side of Eq. 14.911 defines 
the product of the empty vector to be unity. 

We have now computed all the quantities needed to evaluate the scalar 
index r of Eq. 14.751 With the above quantities, the scalar index appearing in 
the PCT is now given by: 

r = 7(((-f^) \7 i);<d-a> p 2) (4.92) 

Now we need to compute the offset vector a appearing in Eq. 14.761 In order 
to do that we form the quantity: 

^{Hr{j) V {Pm - ^^0)) - .(1) -<0> . (4.93) 

We see that the offset vector a contains only one entry < >. Thus the 
combination r + a selects only one element. This is because although j is a 
partial index of i, it is indexing the subarray ^. Therefore, the index j is a full 
index of the subarray ^. This is consistent with the fact that, although we are 
dealing with a partial index j and a subarray ^ it is one step in the overall 
calculation of the action of i on the array ry, the action of which is to select a 
single element. 

We now summarize the calculation so far. We now have: 
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ravOV-O = rav(((-a) A i)Vr/)[(7(((-'T) y i);< rf-^> ? 2)+ <0>]. 

(4.94) 

Now we can simplify the scalar index r of Eq. 14.921 by noting that as 
the index ((— cr) y i) cycles through all possible values (in order), the scalar 
r takes on all values from to (7r((< d ~ a >) 'p 2) — 1). We thus define 
a new variable t, and the right hand side of Eq. 14.941 {RHS) simplifies as: 
yt, s.t. 0<t< (7r((< d-a>)p 2)) 



RHS= rav(((-cr) A i)tpr])[t+ <0>] = rav(((-CT) A i)V'?7)[<t>] (4.95) 

Now, we must apply the PCX again, this time to the quantity 

((-a) A i)^Pr], (4.96) 

that appears in Eq. 14.951 In order to facilitate the application of the PCT, we 
redefine the variables j and ^ appearing in the PCT as follows. We write: 

j = ((-a) A i), (4.97) 

and, 

C = ry = (<rf> p 2)p X, (4.98) 

where we have used the definition of rj given in Eq. 14.641 

We now proceed to evaluate, step by step, the quantities appearing in the 
PCT based on the new definitions of j and ^ given in Eqs. 14.971 and [T951 The 
shape of the new index j is given by: 

p(j)=<a>, (4.99) 

and the total number of components of j is 

r(j) = a. (4.100) 

The shape of f is given by 

p^ ^<d> p 2. (4.101) 

Using these results we have 

T(j) A {pO=<^> P 2, (4.102) 

followed by 

T(j) V {pO^<d-a> p 2, (4.103) 

and 

<r{i) V {pO)^<<d-<y> ?2) = 2^^^ (4.104) 

We now have computed all the quantities needed to evaluate the scalar r of 
Eq. 14.751 Explicitly we have 
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r EE 7(((-ct) a i); (<cr> p 2)) * 2'^-'' (4.105) 

We can also simplify this expression by noting that as we cycle through all 
possible values of the partial index vector {—a) A i, the function 7(((— cr) A 
i);(<(7> p 2)) takes on all values from to (7r(<cr> p 2)-l) = 2'^-l. Thus 
we define a new variable s and we note that Vs, s. t. < s < (7r(<i7> p 2)) 
the scalar variable r is written as: 

r = s*2''"'^ (4T06) 

Lastly, to complete the application of the PCT to the expression of Ea. [4?96l 
we compute the offset vector: 

V (pO))) = ^(2''"'^)=<0 1 ...(2''-'^-1)>. (4.107) 

Now we summarize this second application of the PCT. By acting on the 
expression in Eg. 14.961 with the ravel rav operator and applying the PCT we 
obtain: 

rav(((-cr) A i)i/'??) x[s * 2'*-'"+ <0 1 ...{2'^-'' - !)>]. (4.108) 

We see that for each value of s, the result of Eq. 14.1081 is a vector of length 
2'^~" . Returning to Eq. 14.951 however, we find that to obtain the final result, 
we must select a component of Eq. 14.1051 using the index <t>. Thus the final 
expression in Eq. 14.951 using Eq. 14.1081 is written 

rav(((-(T) A i)7/)77)[t] ==x[s*2'*-'"+ <0 1 . . . (2''-'" - 1) >][<t >], (4.109) 

which is simply equivalent to the expression: 

rav(((-cr) A i)V'?])M x[s * 2''"'"+ <i>], (4.110) 

which is the final result. 

4.4.6 Summary of Final- Transpose Operation 

The ONF is now expressed as follows. Define two new indices t and s with 
limits given by: 

< t < t* = {■K{{<d- (T>) p 2)), (4.111) 

and, 

< s < s* = {n{<a> p 2)), (4.112) 
we define a new two-dimensional array by reshaping the vector ^ as: 

^(2) =<s* t*> pi, (4.113) 

where ^ is defined in Eq. 14.601 The final result is then written: 

<st> i^^^^^ ^k[s*2'^''' +t]. (4.114) 

This result was translated directly into code as shown in Fig |5.4l 
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void final_trans (complex *datvec , complex *temp,int nmax, 
int lcap,int csize) 

{ 

/* dvar = # of 2's in the hypercube 

* leap = # of transpose-reshapes (T-rho) that have to be undone 

*/ 

lege = logCfloat (csize) ) /log(2 . 0) ; 

sig = lcap*logc; 

smax = pow(f loat (2) , sig) ; 

dvar = log(f loat (nmax) ) /log(2 . 0) ; 
tmax = pow(2.0,dvar-sig) ; 

index = 0; 

f or (tind=0 ; tind<=tmax-l ; tind++) 
{ 

f or (sind=0 ; sind<=smax-l ; sind++) 
{ 

temp [index] = datvec [sind*tmax + tind] ; 
index += 1 ; 

} 

} 

f or (index=0 ; index<=nmax-l ; index++) 
-C 

datvec [index] = temp [index]; 

} 

} 

Fig. 4.3. C++ code fragment implementing the final transpose bringing the data 
back into the correct order. 

4.5 Conclusions 

We have presented a tutorial introduction to the techniques of Conformal 
Computing illustrated in the context of the new cache-optimized FFT al- 
gorithm presented in the preceding chapter (Part I: see Chap. [3]). Two key 
aspects of the new algorithm, the reshape-transpose operation and the final- 
transpose operation were presented and discussed in detail. These two exam- 
ples are excellent vehicles for developing and illustrating the new techniques 
in that many of the most important concepts of the theory (such as the notion 
of shapes, re-shaping, indexing, the ^ function, the ip- correspondence theorem, 
etc.) are presented. Indeed, the reshape-transpose operation is an extremely 
important concept which is applicable in many other situations. In addition, 
the ip- correspondence theorem is a cornerstone of the Conformal Computing 
approach. It is the key link between the Mathematics of Arrays (MOA) , which 
provides a means for reasoning about the algebraic properties of array-based 
algorithms, and the '0-calculus which allows one to reduce an algebraic expres- 
sion to an explicit form that can be directly translated into computer code in 
any computer language. 

The Conformal Computing approach is leading to important new insights 
by allowing one to view multi-dimensional arrays, their decompositions and 
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mappings in a unified, general way. In other words, one can change the dimen- 
sionality of a given array, through the use of the reshape operation to suit the 
needs of the application at hand (without necessarily moving the data). In 
particular, the multi-dimensional hyper-cube played an important role in the 
development of the insights leading to the final-transpose operation appearing 
in the new FFT. In another ongoing investigation, the use of the hyper-cube 
representation is shedding new light on problems related to quantum comput- 
ing: an area in which the hyper- cube is a natural data structure in a context 
based on two-state qubits (see Chap. [51). 
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A Cache-Optimized Fast Fourier Transform: 
Part III 



5.1 Chapter Summary 

This chapter continues to develop the techniques of Conformal Computing 
as apphed to the Fast Fourier Transform (FFT) that were introduced in the 
two preceding chapters. In these previous chapters, a new cache-optimized 
algorithm was presented that was two to four times faster than our previous 
records (which beat or were competitive with well-tuned library routines). 
This chapter presents a new hyper-cube representation that is contrasted with 
that of the preceding chapters. We argue that any arbitrary partitioning of 
the data over cache, processors, etc. can be efficiently handled in a hyper-cube 
representation. The rearrangements of the data (virtual or physically-realized) 
are represented in the hyper-cube in terms of direct indexing, thus avoiding 
most temporary arrays. Implementation and performance details, presented in 
the two preceding chapters are also reviewed. In addition to the presentation 
of this new hyper-cube view, this chapter also serves as a continuing tutorial 
introduction to the methods of Conformal Computing. 

5.2 Introduction 

The focus of this chapter is the formulation of the FFT in a generalized 
hyper-cube representation using the high-level techniques of the Conformal 
Computing approach. 

The Fast Fourier Transform (FFT) is one of the most important compu- 
tational algorithms and its use is pervasive in science and engineering. The 
work in this chapter builds on that of the two previous chapters in which 
the FFT was optimized in terms of in-cache operations leading to factors of 
two to four speedup in comparison with our previous records. Further back- 
ground material including comparisons with library routines can be found in 
Refs. [laiTHlliniEO] and [16] and in Chap.ffl 
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It is also important to note the importance of running repro- 
ducible and deterministic experiments. Such experiments are only 
possible when dedicated resources exist AND no interrupts or 
randomness affects memory/cache/communications behavior. This 
means that multiprocessing and time sharing must be turned off for 
both OS's and Networks. 

The MoA is a consistent mathematical system in which operators act on 
arrays to carry out arbitrary rearrangements of the array elements. Arrays 
can be repartitioned (reshaped) in arbitrary ways. The MoA bears similarities 
to Linear Algebra and Group Theory but was designed specifically to allow 
reasoning about the mathematical problem to be solved (i.e. the application) 
and layout of the underlying hardware using a common formalism. By using 
the MoA one obtains high-level monolithic array expressions. 

The second cornerstone of the Conformal Computing approach is the ip- 
calculus. Each of the various operators in the MoA is defined in terms of its 
action on the indices of the array on which it operates, as defined by the array 
shape. The V'-calculus allows one to translate the high-level MoA expressions 
into the so-called Denotational Normal Form (DNF), an expression involving 
only cartesian indices of the array, i.e. the semantics. In this way temporary 
arrays are virtually eliminated. Also, due to the mathematical properties of 
the V'-calculus (i.e. the Church- Rosser property [70]) two expressions may be 
proven to be equivalent by demonstrating that they reduce to the same normal 
form. This ability will become increasingly important as issues of performance 
necessarily extend to power consumption, heat generation, etc. 

To take the DNF into a form (i.e. the Operational Normal Form) (ONF) 
that can be directly translated into efficient computer code in any hard- 
ware/software language, one then employs the so-called Psi Correspondence 
Theorem (PCT) as discussed in the Chap. [H The resulting ONF can be 
directly translated into efficient computer code because the ONF explicitly 
shows how data should be manipulated. Loops are revealed in terms of stops, 
starts, and strides. 

The techniques of Conformal Computing have a long history dating back 
to the work of Sylvester in the nineteenth century. The Universal Algebra of 
Sylvester, introduced in 1894 [M], and reintroduced by Iverson [71], formed 
the basis for the programming language APL and subsequent machine de- 
sign [TS]. Abrams' [THj revolutionary insight into the use of shapes to define 
array operations was the inspiration for the V'-calculus. Unfortunately, APL 
had too many mathematical anomalies [72] to be used as a formal mathe- 
matical tool. In addition, APL never had an associated indexing calculus like 
the ip-calculus. Similarly, closure was not obtained for Abrams' indexing rules 
despite 10-h years of research [75 ] [7^ 1 [75 1 [75 ] [77] . 

MuUin's introduction of MoA and V'-calculus removed all anomalies in Iver- 
son's algebra and put closure on Abrams' indexing through the introduction 
of the indexing function, She also combined MoA and V'-calculus with the 
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A-calculus |78) to achieve full reasoning capabilities computationally building 
upon recommendations from Perils [TS] , Berkling [HU] , and Budd [5T] . 

The most difficult aspect of the Conformal Computing approach is the 
need for one to learn to think in the space of multi-dimensional arrays and 
to envision an algorithm in which the architecture and network are 
viewed as one data structure. This is the part of the approach (that is still 
somewhat of an art-form) that leads to the high level formulation of the prob- 
lem in monolithic MoA constructs The techniques of the ^-calculus are more 
straightforward and can be applied mechanically since all transformations are 
linear. 

This chapter continues to introduce higher-level concepts of the theory as 
required for the application at hand: the hyper-cube formulation of the FFT. A 
tutorial style is adopted as in previous chapters as the concepts are, no-doubt, 
unfamiliar to most readers. Thus we present all steps of every calculation. As 
such, there is considerable mathematical detail which may appear formidable 
at first glance. The determined reader, however, will no-doubt be rewarded 
by going through each step in detail. All fundamentals of the theory needed 
to approach this material were presented in Chap. [21 Again we emphasize 
that Conformal Computing is not a programming language but rather is 
an algebraic approach to an efficient construction of computer programs for 
implementation in any programming language. 

5.3 Extreme Generality of Representation: Contrasting 
MoA with Linear Algebra 

5.3.1 Linear Arrays and Hyper-Cubes 

The Mathematics of Arrays is extremely general in its ability to represent 
multi-dimensional arrays. Conceptually any array, independent of its dimen- 
sionality can be thought of as a one-dimensional array simply by forming the 
vector containing all of the array's elements in some pre-chosen order. Since 
this process is done so frequently in MoA we define it as the operation Ravel. 
Thus: we define the operator rav to be a unary operator that produces a 
vector V of the elements of the multi-dimensional array A as: 

V = ra.v A (5.1) 

The vector v so produced is a linear array. As such it is the representation 
of the data with the least number of dimensions but the greatest number of 
elements in a given dimension. 

In the opposite extreme, we can equally envision an array with the most 
number of dimensions. The hyper- cube is just such an array. Each dimension 
has only two allowed values and 1 and the number of dimensions is equal to 
log2 {N) where N is the total number of elements in the array. The ordering of 
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hyper-cube is chosen to correspond to row-major ordering so the array index 
of an element read from left to right corresponds to the binary representation 
of the number of element in the array (assuming zero offset arrays, as is the 
case for the C/C++ language). 

In between these two extremes, one can imagine a host of multi-dimensional 
arrays. The various ways in which an array of length TV can be partitioned 
is by given by the ways in which the number N can be factored: each factor 
corresponding to the length of a given dimension and the power of the given 
factor corresponding to the number of dimensions having the given (factor) 
length. 

5.3.2 The Importance of the Array's Shape 

The power of the Conformal Computing approach lies in its abil- 
ity to view the array using any multi- dimensional representation 
that is convenient. For example, we often (1) pose the problem as a multi- 
dimensional array based on the structure of the underlying science or engi- 
neering problem, (2) increase the dimensionality of the problem to represent 
a given partitioning of the data (block, cyclic, etc.) over the hierarchy of the 
machine (cache, memory, paged memory, disk, network, processor, machine, 
grid-network, etc.). (3) It often is convenient to change the dimensionality 
of the problem again by viewing it as a hyper-cube. At this stage all of the 
formal operations (matrix transformations, etc.) are carried out. Lastly we 
change the dimensionality further by (4) projecting down to a linear array 
for the final implementation on a computer. The final form based on a linear 
array (i.e. the Operational Normal Form: ONF) is extremely efficient in that 
direct indexing of contiguous memory addresses is used. 

The underlying view of MoA is that an array is specified by two vectors: (1) 
the shape vector, and (2) the one dimensional array of the array's elements, 
i.e. layout. For a given array A the shape vector is a vector of length equal to 
the number of dimensions. Each element of the shape gives the total number 
of elements in a given dimension. As previously discussed (in Sec. I2.1.T|) . the 
second vector is simply the Ravel of the array. 

Just as we introduced an operator rav to obtain the Ravel of the array 
(written rav A) we introduce the shape operator p to obtain the shape pA. 
Thus the shape vector s is obtained from the array A by the assignment 
s = pA. 

The ability to change the dimensionality of an array is provided by the 
reshape operator written as p. The reshape operator p is a binary operator 
that takes an array (the array to be reshaped) as the right argument and 
a shape vector (i.e. the new shape) as the left argument. This operation 



^ In general the choice of ordering is arbitrary (e.g. row-major, column-major, etc.) 
and is conveniently specified by the definition of the corresponding gamma func- 
tion. 
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creates the new array by fiUing the elements of the new array sequentially 
from the Ravel of the old array. Note that the shape gives us new ways to 
access the structure not the layout, which is still row or column major. That 
is, access changes without actually creating a new array. The reshape-operator 
is completely general in that it can take as its left argument any arbitrary 
shape and is not constrained by shapes corresponding to the same number 
of elements of the old array. If the total number of elements of the new array 
is larger than that of the old array, one simply starts at the beginning of the 
Ravel once one runs out of elements. For reshaping into a smaller array one 
simply takes as many elements from the Ravel as will fit into the new array. 

5.3.3 MoA Operator Constructs and the i/?-Calculus 

The extreme flexibility to work in multi-dimensional arrays afforded by MoA 
results from the use of an advanced algebraic system that is similar to standard 
Linear Algebra but is considerably generalized. The structure of this algebra 
is summarized in Chap. O Here we recall some of the most important notions 
of the theory and contrast it with similar (but limited) constructs in standard 
Linear Algebra. 

So far we have introduced the operators Ravel rav , shape p, and reshape 
p. Another very important operator for our present purposes is the transpose 
operator 0. The transpose operator is a binary operator that takes an array 
as its right argument and a permutation vector as its left argument and has 
the action of permuting the order of the elements of the array by permuting 
the dimensions H All of the operators in the theory are defined in terms of 
the effect on the indices of the array with shapes. The connection between an 
array's index and its elements is given by the psi operator ■(/;. 

We now demonstrate the use of the operator formalism to indicate the 
data restructuring discussed in the first paragraph of section fS. 3. 21 In step (1) 
we pose the problem in terms of monolithic multi-dimensional arrays A, B, 
etc. These we specify by their shapes and Ravel's: 



(2) Next we reshape the arrays to correspond to a given decomposition with 



(3) Next to carry out the operations of the theory (linear transformations) we 
transform to the hyper-cube representation. 



va = rav A; — pA. 



(5.2) 



A' = s 'pA. 



(5.3) 



A'h=<2 2 2 ■■■ 2> pA', 



(5.4) 



Note that the MoA/i/)-calculus definition of ® is part of the Fortran 95 standard 
and was introduced therein by Mullin. 
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(A series of numbers between angle brackets is used to denote a vector.) 
(4) After carrying out all of the transformations required to reach a final form, 
that we denote by Bh we transform back to a linear array: 

vb = ravSij. (5.5) 



5.4 FFT in the Hyper-Cube 

Suppose the input vector was 

q =< 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > . (5.6) 

We put the index position into the contents of q so that we can see how 
the indices move around during transformations. Consequently, with an input 
length of 16, there are I = log2lQ = 4 cycles, labeled by j, to the FFT: 

Vj; 0<j<l. (5.7) 

So, prior to the initial step, we restructure q: 

Z=i<l> p 2) p q, (5.8) 
(where (< ^ > p 2) is a vector consisting of 'T' 2's; in this case I = 4). 



Step j=0 

The initial hyper-cube is given by: 
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(5.9) 



We now state the hyper-cube FFT in Conformal Computing notation and 
illustrate its use. In the following sections we present exhaustive detail as to 

how it works by supplying the reader will all of the steps of the derivation. 
At each j step wc want to update all transposed pairs: 

Zj = (< >^ J7<i i>Zj)+J7<o i>(< 1 — 1 >x i^<i o>((< 1 >)j/"*^<i i>Zj))-, 

(5.10) 

where, 

Z,^t,Q)Z. (5.11) 

with, 



t, ^ {{{I - 1) - j) A c ) ^((-1) A c ) a {{{I - 1) - j) V c )). (5.12) 
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The vector t ^ permutes the dimensions of the hyper-cube so that at each step, 
neighboring elements are the ones that need to be combined for the FFT. In 
Eq. 15.121 the vector c is a zero-offset vector of integers of length equal to the 
number of dimensions of the hyper-cube: 



c = 



(5.13) 



The vector t^- was found by observing the patterns that arise in the FFT. 
Examples of the hyper-cube permutations are given in the following three 
steps. The structure of tj will be explored in some detail shortly. 

We know there are 4 cycles to the FFT since I = ^052 16 = 4. Recall, 
< j < ^. In step 0, we want to index all pairs and simultaneously update all 
components. That is, all pairs are updated by expressions in Eqs. 15.101 15.111 
and [5321 

In step 1, we must permute the hyper-cube s.t. we can access the following 
permuted indices: 



(5.14) 



In step 2: 



In Step 3: 
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(5.15) 



(5.16) 



5.5 Matrices, Arrays, Hyper-Cubes 

q is the input vector, we define: 

Z={<1> p 2) p q, 

and V < j < I, let: 



(5.17) 
(5.18) 



Zj — (<O>0 J7<i i>Zj)+J7<o i>(<l — l>x ■f^<i o>(< 1 > 4)i^<i i>Zj)), 

(5.19) 
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where, 

Zj = tj Q Z, (5.20) 

with, 

t, ^ {{{I - 1) - j) A c ) ^(-1 A c ) Mj a (((/ - 1) - j) V c ))(5.21) 

To summarize, Z is the initial hyper-cube formed by the elements of the 
input vector q and Zj are the various permuted hyper-cubes. The array c 
is a vector of integers that gets permuted to give the permutations of the 
hyper-cube. 

In this chapter, we don't attempt to prove Eq. 15.191 algebraically. It sum- 
marizes the result of our experience with the FFT as illustrated in our two 
previous chapters. Rather, we proceed with Eq. l5.19l as written and apply the 
machinery of Conformal Computing to derive the ONF. 

It is helpful to consider the structure of tj in detail. Therefore we now 
illustrate the construction of tj for a complete example in Fig. 15.11 

tj : suppose Z = 8^c=< 01234567> 
< j < ^ ^ j = 0,1,2,3,4,5,6,7 
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Fig. 5.1. Illustration of the construction of the index tj. 



5.6 Derivation 

We first reduce Zj then tj QjZ. The two derivations are then merged. 

We begin with Eq. 15.191 in which Zj is entirely updated by taking the 
0th component of all 2 component vectors (pairs) and adding < 1 — 1 > x 
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the 1st component of all 2 component vector^. The easiest way to perform 
ip reductions, is to reduce an expression's constituent pieces separately. Thus, 
we'll rewrite Eg. 15.191 as follows. 
Let: 







A = 


(< >^ r2<i i>Zj), 


B EE 


(Cx^2<io>^), 


C = 


< 1 - 1 >, 


D = 


(< 1 >^ i7<i 







(5.22) 

The quantity Z is the restructured (i.e. hyper-cube) input vector. 
5.6.1 Reduction of D 

Now, by applying the definition of Q (see Sec. I2.1.4p we demonstrate the 
reduction of D. We have: 

£» =< 1 >^ r2<i i>Zj. (5.23) 

That is, take the 1st component of each vector in Zj. If we look at the entire 
expression above we'll see that A takes the Oih component of each vector in Zj . 
Thus, except for a sign difference, the derivation is the same. Consequently, 
the derivation for A will be omitted. Applying the definition of Omega (see 
Sec. [2T4|l : 

il =< 1 >, gf^d =i, fi<l 1>, ir ^ Zj, with Zj = tj (5 Z, 

SS^l^l, d=<(Ji(J.r>; SCr^l- (5.24) 

That is ^; is < 1 >, a vector, and is 1-dimensional. d is used for partitioning 
information. Here ct; is 1, so we know we want vectors from the left argument, 
^i, and vectors from ^r, since 0"^ is 1. Continuing we have: 

p^i =<l>,g = p^r^{<l> P 2) 

ai = 1, (Jr = 1. 

We now present all steps of the reduction of D by simply applying the 
definition of the f2 operator as follows: 

m = 0, 
X = < >, 



^ We assume that weights have been applied to q. 
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u = V ((-1)V <1>) = 0, 

V = V ((-1) V P^r) = {<l-l> P 2), 

y = (-1) A p^i =< 1 >, 

z = (-1) A p^r =< 2 >, 
p^i = u -ff X 4fy =<> + <> + <! >=< 1 >, 
p^r = V +hx +hz = V -ff z =< V z >= (< ; > p 2). (5.25) 

Thus, 

i = <>, 
0<*j <*v, 
k = <>, 

W = p{{<> V6)V'(jV'<er)), 
= P{<1> VOV'^r)), 

= (r < 1 >) V ((rj) V (pCr))), 
= lV(<'^-l>V(<«>P 2)), 
= IV < 2 >, 

w = <> . (5.26) 

Note, in the above, we made use of the identity < > = ^ that holds 
(by definition) for any array ^. 

5.6.2 Denotational Normal Form for D 

Prom above, the DNF for D is: 

D = {<1>^ i7<i i>Zj). 
Thus, the shape of D is defined by: 

Pi^lg^d^r) = (U-H-V-H-X-H-W), 

pD = {<l-l> p 2). 

Components are extracted using the index v with <* v <* (< / 
as foUows: 

Vj; 0<*j<*v, 

j^D = jV(^K5^d)Cr), 

= (<> ^ClMi^^r), 

= < 1 > VXj^^j), 
= (j*l)V'^,- (5.29) 



(5.27) 

(5.28) 

-l> P 2), 



The final result above is a scalar. 
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Note the simple heuristic way of thinking of D. We can see that the array D 
is simply the collection of all elements of Zj whose index vector has the value 
1 in its right-most bit. For example, if Zj has shape pZj =< 2 2 2 2 >, then 
the array D contains the following elements: < 1 > ipZj, < 1 1 > ipZj, 

< 1 1 > ^pZj, < 1 1 1 > TpZj, < 1 1 > < 1 1 1 > ^pZj, 

< 1 1 1 > ipZj, and < 1 1 1 1 > ipZj. In like manner, array A has a in its 
rightmost bit. 

This means that for step j = we are working with the initial hyper-cube 
of Eq. 15.91 the ravel of D is given by: 

ravL* =<1 3 9 11 5 7 13 15>, (5.30) 

while the ravel of A is given by: 

ravA=<0 2 8 10 4 6 12 14> . (5.31) 

5.6.3 Reduction of B 

Substituting the derivation for D we have: 

Vj; 0<* i<* {<l-l> P 2), (5.32) 

B = Cx ^2<i o>D 

= < 1 - 1 > x^2<i o>(< 1 >v> ^<i i>Zj), (5.33) 

jV-B = < 1 - 1 > xr?<i o>(< 1 > Hii'Zj)), 

= < 1 - 1 >x r2<i o>((j *l)^^j). (5.34) 
Now we apply the definitions associated with f2: 

p^l =< 2 >, d =< 1 >, pS,r =< I - I > p 2, 

<Ti ~ I m — ar — 0. (5.35) 

Continuing: 



X = 


A (-1) V < 2 > 


=<>, 


u = 


V (-l)V < 2 > 


=<>, 


V = 


V V PCr 


= {<l-l> p 2 


y = 


(-1) A < 2 > 


=<2>, 


z = 


A p£,r 


=<> . 



(5.36) 
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0<*i<*u=<>, i=<>, 
<* j <* v = (< Z-1 > p 2), 

<* k <* x=<>,^k=<> . (5.37) 

W = ^k)VX6(9^der))), 

w = p{<> V; < 1 - 1 >) X {ii^D), = p{<l - 1 > x(j ^l)i>Zj)) 

w = < 2 > . (5.38) 

5.6.4 Denotational Normal Form for B 

Prom the above we conclude: 

B = {C^Q^ioyD). (5.39) 

The shape of this expression is: 

pB = \i 4f V +hx +Fw, 

= < / > p 2. (5.40) 

Components of this expression are extracted with the vector j such that: 
Vj, 0<*j <*(<;-!> p 2), 

(i 4+k)V'S = ((i ^k)V'Ci) X ((j ^k)^er), 
jVB = (<> VC) X (j^D), 

jVB = < 1 - 1 > x(j ^ < 1 >)i,Zj. (5.41) 

This last result is a two-component vector < d {—d) > and for each value 
of j the variable d takes on the components of the ravel of D (i.e. d = 
{1, 3, 9, 11, 5, 7, 13, 15}). 

5.6.5 Reduction of A 

We don't need to do this derivation since it is nearly identical to the derivation 
for D. Thus it will be omitted. Therefore, 

A=<Q>^ n^iiyZj, (5.42) 

and, Vj 0<*j <* (<Z-1 > p 2), 

ii^Zj = ((j ^0)^Zj)+f2^o i> < 1 - 1 > x((j -H-l)ijZj). (5.43) 
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5.6.6 Denotational Normal Form for A 

The DNF for A has the foUowing shape: 

pA={<l-l> p 2). (5.44) 
The components of A are obtained as: 

jM = (< > = (j * < >)V^,. (5.45) 

Note the similar result to that obtained for D as expressed in Eq. 15.311 

5.6.7 Final Reduction 

Consequently, Vj; <* j <* (< ^ - 1 > p 2), 

jiljZj =jiP{A+f2<Qi>B), 

= ((j * < >)ijZj))+[2<o !>(< 1 - 1 > x(j ^ < 1 >)t^Z,l5A6) 

Now, again we apply the definition of i7: 

= A, gf2d =+^<oi>, S,r = B, 

S^l = I - I, d =< (71 (7r >, 5£_r = I, 
p^l = {<l-l> p2), p^r = {<l> P2), 

ai = 0, Or^ 1. (5.47) 

Thus, 

m = Z — 1, 

X ==-(/- 1) A V (<^ - 1> ? 2) = (< ; - 1 > p 2), 

u = -(;-i) V V (<^-i> p 2) >, 

V = -(/ - 1) V 1 V (<^> P 2) >, 
y = A (</ - 1> p 2) =<>, 
z = (-1) A (</> p 2) =< 2 > . 

(5.48) 

Continuing, 

<* i <*<>, 
<* j' <*<>, 
<* k <* (< ; - 1 > p 2), 



w = p((j -H- < >)^Z,) + (< 1 - 1 > x((j ^ < 1 >)7/.Z,)), 

w = < 2 > . (5.49) 
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This expression results from first taking components and then finding the 
shape of the result. The components are obtained in the following (see 

Note: Zj is raveled (flattened) after the transpose, i.e. 

z, = ravtj (5.50) 

5.6.8 Final Denotational Normal Form 

We now find the final DNF. The shape is given by: 

p {A+Q<o i>B) = u 4f V 4f X -Vrv^ =< I > p 2. (5.51) 
Components are obtained as: 

(i ^j' ^k)V(A+r2<o i>B) = ((i + ((j' ^k)VB), 

ki/;(A+r2<o i>B) = (kM) + (kV-B), 
jM+^<o i>B = jtpA + {jtPB), 

= < > i^{i^jZ/}+ < 1 - 1 > x(< 1 > ij{j^z,)), 

= ((j ^0)^^,)+ < 1 - 1 > x((j -i^l)i^Z,). 

(5.52) 

Note that both k and j have the same limits: <* k <* (< ^ — 1 > p 2), and 
0<*j<*(<^ — 1> p2), respectively. This is why we have changed from 
k to j in the third line above. That is why we say j 4f or j -H-l to index a 
scalar from Zj. 

In simple terms, the right hand side of Eg. 15.521 can be written as: 

a+ <1 - 1> xd = a+ <d {-d)>^<{a + d) (a - d)>, (5.53) 

which says, take each element of A (denoted by a) and add and subtract each 
corresponding element of D (denoted by d). This way of combining correspond- 
ing elements (assuming the FFT weights have already been included 
in D) is called the butterfly operation (sec Sees. I3.4.2[ 13.4.71 and Eqs. 13.171 
and I3.18P and the n component vector obtained by joining all of the vectors 
<{a-\-d) (a — d)> gives the vector that results (i.e. Zj) after applying a single 
iteration of the FFT. 



5.7 Thinking in Hyper-Cubes 

What follows is a description of how each cache piece in the FFT can be 
viewed as a hyper-cube with dimension log2 n with n the cache length. Unlike 
the discussion in Chaps. [3] and[4] which reahzed each cache piece (i.e. physically 
moved data around) to bring locality to the cache, we access the indices that 
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are within the cache directly without an intermediate realization between each 
transpose. Recall that the access patterns for the FFT predicted cache misses. 
Consequently, with that anticipation, it becomes possible to determine how 
much to pre-fetch and how much to compute. Note that this process could 
easily be extended up the memory hierarchy which includes the network. 

Now we are in each cache piece. Here we do not want to realize each FFT 
transpose. That is, we now want to determine the indices we need for log2 n 
cycles of the FFT since we now we have locality. Our description of the 
butterfly^ is a hyper-cube reformulation using MoA algebra and subsequently 
the ■i/'-calculus. We show how multiple transposes on this hyper-cube can be 
expressed as an invariant number of loops (three for this example), starts, 
stops, and strides. We see through the formulation of the arithmetic needed 
for the FFT that we index pairs of components in the hyper-cube then assign 
the same components their updated computed FFT portion. Pairs are assigned 
at once. Here we derive the transpose, i.e. the final step. 

5.7.1 Transpose Formulation of Butterfly 

Let I = log2 n, and Vj; < j < ?, let q denote the input vector below, with 
n = 2'. Consequently, Z, is the dimension of the hyper-cube we'll define: 

Z,-=t,Q((<Z> p 2)pq). (5.54) 

That is, during each iteration, j, of the FFT on the vector q, a new transpose 
vector tj is created and conscqiiently a new transpose is performed, i.e. Zj = 
tj Q) Z . In this case, we DO NOT want to materialize the array. We envision 
this algorithm to be applied to a data vector q that fits in the cache. 

Consider the vector of integers c = t(/), where I = log2n is the di- 
mensionality of the hyper-cube. Thus the original data vector q becomes 
{< I > p 2) p q. Now we want to consider tj Q ((< I > p 2) p q ). To do 
this we consider indexing with i where i is a full index of the hyper-cube, 

i.e. Vi; <* i <* (< Z > p 2). (5.55) 

Thus, we wish to 'i/'-reduce the following expression: 

iV(tj G) ((< ^ > P 2)P q)), (5.56) 
where the transpose vector tj is defined as: 
tj ^ {{{I - 1) - j) A )c ) *((-!) A c ) a (((/ - 1) - i) V c )). (5.57) 
By the definition of transpose we have: 

i^(tj ((< / > p 2) p q )) = i [t ]^((< Z > p 2) p q ). (5.58) 



Now define Z = {<l> p 2) p q. Thus, 
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= Ac)] Ac], 

A ((G-l)-j) V c)])V^Z. (5.59) 

Now we take the index apart and apply the PCT (see Sec. 14.4.^1) . Thus, we 
have 

(i Vj *k)^Z = (kV^(i Vj)V'Z) = kV'(jV'(i '^Z)) = kiPY, 

for any any i ', j, k. (in this case i ' is not the same as the above i ). Now we 
define: 

k^i[.? A (((?-!) -j) V c)], (5.60) 
F = (i [(((/ - 1) - j) A c )] [(-1) A c ])i;Z, (5.61) 

and we will now apply the PCT (see Sec. I4.4.i|) : 

rav (kVF) = ( ravy)[7(k; (rk) A (pF)) X7r((Tk) y {pY))+L{7r{{Tk) y (pY)))]. 

(5.62) 

Now we determine the shape of Y,{pY) as: 

pY = pii [(((/ - 1) - j) A c )] [(-1) A c ]) V (PZ). (5.63) 

Thus, 

(pZ) ^ <l> p 2, (5.64) 
{pY) = {l-j) V {<l> P2), 

= <j > p 2. (5.65) 

Next we need: 

Tk = j, 

(rk) V (pY) =J V {<J> ? 2) =<> . (5.66) 

Thus, 7r((Tk) V (pY)) = 7r(<>) = 1, and (rk) A (pY) ^ j A {< j > p 2) 
=< j > 'p 2. Using these results we apply the PCT to get: 

rav(i[j A {{{l~l)-3) V c )]^r, 

= (ravr)[7(i[j A (((/-l)-j) V c)];<j> ? 2) x ! + .(!)], 
= (ravr)[7(ib- A (((/-l)-j) V c)];< j > ? 2)]. (5.67) 

Now we simply another step: j A {{{I — 1) — j) y c ) = ((/ — 1) — j) + 
Thus, 

rav(ib- A (((/-l)-j) V c)]^r 

= (ravr)[7(i[((?-l)-j) + t(j)];(<j > ? 2))]. (5.68) 

Now we reduce Y further: 
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= i[(-l) A c]^(i [(((/- A c)]VZ, 

= A c]^W. (5.69) 

Where we have the definitions: 

W = imi Ac )]ijZ, (5.70) 

and, 

k' =i[(-l) Ac]. (5.71) 
Thus the PCT in this case is: 

(rav(k VVF)) = (ravVF)[7(k '; (rk ') A (pW)) x 7r((Tk') y (pW)) 

+ t(^((Tk') V (PW^)))]- (5.72) 

Thus we need: 

(rk ') = T(i [(-1) A c ]) - T(i - 1]) - 1, (5.73) 

and, 

pW = pmil - 1) - j) A c)] V P^), 
= a-l-j) V (<^> P 2), 
= <l-il-l-j)> p2, 

= <j + l> p2. (5.74) 

Thus: 

(rk ') A pVF = 1 A (< j + 1 > p 2)=<2 >, (5.75) 
(rk ') V = 1 V (< J + 1 > ? 2) =< j > p 2. (5.76) 
Using these expressions we can now evaluate ( ravY) as: 

(ravF) = rav(k VVK), 

= (ravVF)[7(i[(-l) A c ]; < 2 >) x 7r(< j > p 2), 

+ L{7r{< J > p 2))]. (5.77) 

Now simphfy i [(—1) A c ] = i [< ^ — 1 >]. Now the (ravF) expression is 
subscripted as: 

(ravl^)[7(i[<; - 1>]; < 2 >) x 7r(< j > p 2) + l{tt{< j > p 2))] 

[7(i[((/-l)-j) + t(j)];(<j> p 2)]. (5.78) 

In other words, the second expression in brackets, in Eq. I5.78[ is a subscript 
(extracts an element) from the preceding expression (i.e. the first part of 
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Eq. l5.78l that has the form ( rav W) [ ] which is a sub-array since the expression 
in square brackets is a vector). Next reduce W = k il>Z. With k = i [(((Z — 
1) - j) A c )], Z > p 2, the PCX gives: 

rav(k "^jjZ) = (ravZ)[7(k "; (rk ") A pZ) x 7r((Tk ") y pZ) 

+ .(^((Tk") V P^))]- (5.79) 

Using: 

rk" = ((/-l)-j), 
(rk ") A = < (Z - 1) - j) A (< Z > p 2), 
^<{l-l)-3> p 2. 

and, 

(rk ") V pZ =< {l-il- I-])) > VP^ =< ,7 + 1 > P 2, (5.82) 

we obtain: 

(ravVF) = rav(k "ipZ), 

= (ravZ)[7(i[((/-l)-j) A c ];<((/- 1) - j) > p 2) 

X 7r(< j + 1 > p 2) + i(< j + 1 > p 2)]. (5.83) 

This is in a transitional ONF. Why would this be true? Notice, when we go 
from ip indexing to [ ] indexing, we utihze 7, index vectors, and shapes. That 
is, we start to use the Psi Correspondence Theorem . Thus, we have created an 
ONF which does indeed calculate the offset from the start of the array accessed 
contiguously in memory. However, these indices i.e. offsets, are akin to random 
accesses. Ideally, we want to be able to calculate starts, stops, and strides. This 
representation can be generally fed to hardware, e.g. DMAs, FPGAs, and 
ASICS, etc. We do this now. The following Generic Normal Form illustrates 
the idealized form discussed above. Here we have minimized the design to 
three deterministic loops. That is, for any size problem represented in any 
dimension, we have only three loops. We also know how to calculate stop, i.e. 
the upper bound, and all strides as follows: 

Vj; 0<j<l, 

Vto; < to< 7r(<(; - 1) - j> p 2), 
Vn; <n < 7r(< j > p 2), 
Vfc; < fc < 2, 

Zj = (ravZ)[(m x (tt < j + 1 > p 2)) + {k x tt(< j > p 2)) +n]. (5.84) 
qed0 

* Notice that the k loop cycles through the elements of t (2). Also note that q (on 
which Z is based) has the weights applied prior to this step. 



(5.80) 
(5.81) 
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Thus for each value of j an array Zj is created according to Eq. 15.841 
by looping through the variables from fastest (innermost loop) to slowest 
(outermost loop) in the order fc, n, and m. Now we verify that this is the 
correct ordering of the loops. 

For example, assume I — 8, j — A, and define 

CEE t(/) =<0 1 2 3 4 5 6 7> . (5.85) 

The ordering of the loops over variables k, n, and m is related to the full 
index: 

i =<zo ii i2 is «4 is «6 «7>, (5.86) 

used to select an element of the new array. In this index, the variables ip for 

< p < 8 each take on the values < ip < 2. 

In Eqs. 15.681 and [5?78l we find the following pieces of i selected: 

i[</- 1>] = i[<7>] = ^7, (5.87) 

i[((/-l)-j) + i(j)] =i[<3 4 5 6>] =<i3«4i5«6>, (5.88) 

and, 

i[((? - 1) - j) A c] = i[< 1 2 >] =< io iii2> ■ (5.89) 

We see that as i cycles through all possible values, the piece given by Eq. l5.87| 
cycles fastest followed by that in Eq. 15.881 and the slowest cycling piece in 
Eq. l5.89l Close examination of Eqs. [5.68l and [5.78l shows that these three pieces 
(Eqs. I5.87[ I5.88[ and I5.89P appear in 7 expressions involving the following 
shapes: < 2 >, (< j > p < 2 >) and {< {I — 1) — j > 'p 2) respectively. As 
such, the corresponding 7 expressions take on the bounds of the variables fc, 
n, m (from fastest to slowest). This allows us to eliminate the 7 expressions 
in Eg. [Engl and [5751 to yield the final ONE of Eq. lEMl 

(5.90) 

5.8 Merging the two Derivations 

At each iteration, j, we deal with the quantity Zj defined for < j < Z in 
Eq. 15.541 Note also we deal with the vector index j (not to be confused with 
the scalar index j). The vector index takes on the values: <* j <* (< 

1 — 1 > p 2), and when used to index Zj we obtain: 

jipz, = < ((j ^0)^Z,) ((j *1)^Z,) > . (5.91) 
But, Zj = tj G) Z, and Vi; <* i <* (< / > p 2), we have: 



iV(t,Gz) = i[t,]vz. 



(5.92) 
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The DNF previously given for the FFT in Eq. 15.521 thus becomes: 

((j -H-0)[t, ]'iIjZ)+ < 1 - 1 > x((j ^l)[t, ])^Z, 
and when reduced becomes: 

Vj; 0<j<Z, 

Vm; < m < tt{< {I - I) - j > p 2), 
Vn; < 71 < 7r(<j > p 2), 

(ravZ)[(m x 7r(<j + 1> p 2)) + ((t(2)) x 7r(<j> p 2)) +n] 
= (ravZ)[(m X 7r(<j + 1> p 2)) + (0 x 7r(< j > p 2)) + n] 
+ < 1 - 1 > x(rav2')[m x 7r(<j + 1 > p 2) 

+ (1 X 7r(<j> p 2)) + n]. (5.93) 

Notice, that computationally we'd really be calculating the strides in the 
registers, not the control structures as in the classical design of the FFT. 
Finally, let / = 7r(< j > p 2), and let g = n{<j + 1 > p 2) then, 

( rav Z) [(m x g) + n + (< 1 > x /)] 

EE ( rav Z)[{m X g) +n] + < I - 1 > x ( rav Z) [{m x g) + (/ + n){5.94) 

And we finally make contact with Eq. 15.531 by writing the the right hand side 
of Eq.EJlas: 

<{a + d) {a-d)>, (5.95) 

if we identify a and d with: 

a = {rav Z)[{m X g) + n], (5.96) 

and, 

d = {rav Z)[{m X g) + if + n)]. (5.97) 

Thus the entire array of Eq. 15.941 at step j (scalar j) is an array of shape 
(< / > p 2) consisting of the collection of two component vectors given in 
Eq. I5.95i each of which being indexed by the I — 1 component index vector j 
of shape (<^ — 1 > p 2). 

This is the final result which is relatively simple considering the lengthy 
derivation required to produce it. 

5.9 Implementation and Performance 

This section recaps the implementation and resulting performance details that 
were presented in the first two chapters of this series. We emphasize that 
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do q = 1 ,t 
L = 2**q 

do row = 0,L/2-1 

weight(row) = exp((2*pi*i*row)/L) 
end do 

do col = 0,N-1,L 
do row = 0,L/2-1 

c = weight(row)*x(col+row+L/2) 
d = x(col+row) 
x(col+row) = d + c 
x(col+row+L/2) = d - c 
end do 
end do 
end do 



Fig. 5.2. Fragment for the most important piece of the CC code (radix 2, in-place, 
butterfly). In this fragment t — log2{N) is the power of 2 corresponding to the total 
number of array elements, A'', and x is the array being transformed. 



the present chapter is a formal derivation of a practical algorithm that was 
implemented and tested in Chaps. [3] and [H 

The innermost part of the FFT algorithm is shown as implemented in 
Fortran 90 in Fig. 15.21 This algorithm is essentially the non-cache optimized 
FFT that was taken from Ref. [16]. 

The ONF of Eq. 15.941 is equivalent to the cache optimized FFT that was 
implemented and tested in the first two chapters in this series. In practice, 
however, Eq. 15. 94] implies a simpler control structure that increments by unity 
rather than a power of two as illustrated in the code fragment of Fig. 15.21 We 
expect the hyper-cube formulation of Eq. 15.941 to be somewhat faster due to 
this simpler control structure. This conjecture is currently being tested. 

Two other important code fragments implement the reshape-transpose 
operation and the final transpose that carry out data rearrangements for the 
cache optimized FFT. The reshape-transpose operation is given below as im- 
plemented in C-f-t- in Fig. 15.31 and the final transpose is illustrated in Fig. 15.41 

In Figs. 15.21 15. 3[ and 15.41 we intentionally present implementations in For- 
tran 90 and C-f + to emphasize that our derivations serve as a prescription 
for implementation in any language. 

The performance results for our cache-optimized FFT algorithm (pre- 
sented in the two previous chapters) are presented in Figs. 15.5 1 and 15.61 These 
experiments were run in a single-processor, dedicated, non-shared environ- 
ment on the IBM SP2 machine "squall" at the Maui High-Performance Su- 
percomputer center [58j . Specifications for the machine are quoted in the 
caption to Fig. 15.51 
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void traiis_rshp(complex *datvec , complex *temp,int nmax, 
int csize) 

{ 

int iind, j ind, kind, kmax, j max , imax, index , c2size ; 

int rows,arg; 

int max C int a, int b) ; 

// The routine carries out the transpose-reshape operation 

index=0 ; 

rows = nmax/csize; 
imax = rows-1; 

c2size = int (powCcsize ,2 . 0) ) ; 
jmax = max(0,nmax/c2size-l) ; 



f or ( j ind=0 ; j ind<=jmax ; j ind++) 
-C 

f or Ciind=0 ; iind<=imax; iind++) 
{ 

tempEindex] = datvec [j ind+csize*iind] ; 
index += 1 ; 

} 

} 

f or (iind=0 ; iind<=nmax-l ; iind++) 
{ 

datvecEiind] = tempEiind]; 

} 

> 

Fig. 5.3. C-I--I- code fragment implementing the reshape-transpose in terms of 
index manipulations. 

In the first figure fFig. I5.5|l we plot the time vs. input data length. There 
are two curves, one for our new cache-optimized FFT and one for a similar 
run with no cache optimization. Direct comparisons are possible since both 
curves are produced by the same code. For the non-cache-optimized run, we 
simply chose the blocking size c (specified as a parameter at run time) to be 
greater than or equal to the length of the data vector. We see that the curves 
have essentially the same shape but the cache-optimized one is shifted to the 
right by one power of two compared to the non-optimized one. Thus for a 
given run time, we can legitimately claim a factor of two speed-up. 

The results presented in Fig. 15.61 emphasize the improved performance for 
a fixed data size by taking the ratio of the run time for the non-optimized run 
to that for the cache-optimized one. We see that for some of the largest sizes 
considered, a factor on the order of 4 speedup is achieved. 

Figure ISTBl is also enlightening in that it highlights the various levels of the 
memory hierarchy. A change in slope of run-time vs. size indicates the crossing 
of a boundary between one level of the memory hierarchy and another. For 
example from the results of Fig. 15. 61 we can make the following estimates. For 
roughly 2^ < < 2^ the speed is most likely dominated by the speed of the 
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void final_trans (complex *datvec , complex *temp,int nmax, 
int lcap,int csize) 

{ 

/* dvar = # of 2's in the hyper-cube 

* leap = # of transpose-reshapes (T-rho) that have to be undone 

*/ 

lege = logCfloat (csize) ) /log(2 . 0) ; 

sig = lcap*logc; 

smax = pow(f loat (2) , sig) ; 

dvar = log(f loat (nmax) ) /log(2 . 0) ; 
tmax = pow(2.0,dvar-sig) ; 

index = 0; 

f or (tind=0 ; tind<=tmax-l ; tind++) 
{ 

f or (sind=0 ; sind<=smax-l ; sind++) 
{ 

temp [index] = datvec [sind*tmax + tind] ; 
index += 1 ; 

} 

} 

f or (index=0 ; index<=nmax-l ; index++) 
-C 

datvec [index] = temp [index]; 

} 

} 

Fig. 5.4. C++ code fragment implementing the final transpose bringing the data 
back into the correct order. 

registers. For 2^ < < 2^ the speed is dominated by LI and L2 cache and 
for 2^ < A'' < 2^^ the speed corresponds to main memory. For N > 2^^ paged 
memory (of size 4fci?) dominates. The large jumps in performance (i.e. factors 
of 4 for some of the largest sizes) correspond to the presence or absence of 
page faults. 

In general, the performance of the algorithm is a tradeoff between the 
increased speed obtained by having more data in the cache (with increasing 
c) and the cost of actually moving the data around. One might naively guess 
that the best performance would be obtained by choosing c to be equal to the 
cache size. However, we find, the best performance by choosing blocking sizes c 
given by small powers of 2. In other words, it is more economical to move data 
around many times within the cache than it is to move large blocks into and 
out of cache due to the extreme speed of the cache. In a sense, the operating 
system is able to overlap computation and 10 using small blocking sizes c. 
Direct comparisons of our non- cache optimized routine to library routines 
showing comparable, and in most cases, superior performance were presented 
in Refs. [T71 [H [H HD] and Chap. [TJ 
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Fig. 5.5. Comparison of the cache-optimized FFT with our previous i/)-designed 
FFT [19] indicating reproducible enhanced performance. The data in this figure 
represent the raw timing data obtained by running the experiments in a single- 
processor, dedicated, non-shared environment on the Maui SP2 machine "squall" 
(one of 2, 375Mhz Nighthawk-2, IBM SP2 nodes). Reproducibility was demonstrated 
by comparison of the results of five separate runs which produced nearly identical 
results (not shown) . The slope of the curve reveals the speed of various levels of the 
memory hierarchy as amplified in Fig. 15.61 

5.10 Conclusion 

We have presented a derivation of the FFT using the techniques of Conformal 
Computing in the framework of a hyper-cube data structure. The final result 
is given in Eq. 15.941 and is extremely simple given the lengthy derivation that 
led to it. The structure is very simple (one can clearly see the three loops over 
the variables k, n and m for each step j in the FFT) and is independent of 
the length of the input data vector. The addresses associated with the loop 
variables k, n and m will be evaluated in registers. As such, the present imple- 
mentation is expected to be faster than the result of the previous two chapters. 
This is because the outermost loop variable j is successively incremented by 
unity in contrast to the traditional approach in which it is incremented by a 
power of 2. Effort is currently underway to test this conjecture. 

This chapter also serves as a continuing in depth tutorial introduction to 
the methods of Conformal Computing applied to a non-trivial example. Ev- 
ery step of every calculation has been presented in full detail and is based on 
the introduction of the Conformal Computing machinery in earlier chapters. 
These techniques are extremely powerful in that they allow one to eliminate 
temporary arrays through the use of direct indexing (note the role played by 




Fig. 5.6. Performance enhancement of the cache-optimized FFT as compared with 
our previous f/j-designed FFT showing a factor of four enhancement for some of the 
largest sizes. The data plotted in this figure is the ratio of the time for the non- 
optimized routine divided by that for the cache-optimized routine. The changing 
slope of the optimization curve reveals various levels of the memory hierarchy. For 
roughly 2^ < N <2^ the speed is most likely dominated by the speed of the registers. 
For 2^ < Af < 2** the speed is dominated by LI and L2 cache and for 2^ < iV < 2^^ 
the speed corresponds to main memory. For A'" > 2^'^ paged memory (of size AkB) 
dominates. The jumps in performance correspond to the presence or absence of page 
faults. 



the loop variables fe, n, m, in Eg. I5.94j) . Another important aspect of this ap- 
proach is that once the analysis is carried out, the resulting ONF (see Ea. l5.94p 
is a prescription for building an efficient computer program in any 
convenient language for implementation in software or hardware. 
Also, since the same formalism is used to describe the machine as 
is used to describe the science and/or engineering problem, one 
is empowered to reason mathematically about the correctness and 
efficiency of the implementation. 
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Density Matrix Operations for a Quantum 
Computer 



6.1 Chapter Summciry 

This chapter is concerned with the efficient manipulation of sparse matrix 

operations that arise in the simulation of a quantum computer. In particular, 
matrix multiplication traditionally employed to effect the gating operation of 
a particular qubit or collection of qubits is replaced by an equivalent operation 
involving direct indexing of the matrix elements. As such, the efficiency of a 
quantum simulator will be greatly enhanced due to the elimination of the need 
for temporary arrays. The algorithm is completely general and applies to the 
gating operation of arbitrary collections of qubits. The algorithm we present 
allows one to do a number of generalized matrix operations in a single step 
thus eliminating the need for large temporary arrays. 

6.2 Quantum Computing: Motivation for a Matrix 
Problem with Arbitreiry Array Access Patterns 

We now give a brief overview of the motivation for the present problem. The 
dream of Quantum Computing is the realization of a quantum computer in 
which data is represented by the states of a physical system such as the spin of 
an electron or proton. The physical picture one should imagine (to the extent 
that quantum processes can be imagined) is that of a spin or collection of spins 
interacting with electromagnetic fields. One possible embodiment of a quan- 
tum computer would be to utilize an apparatus closely resembling that used 
in Magnetic Resonance Imagining (MRI). Individual spins are manipulated 
through application of pulses of electromagnetic fields. 

The primary interest in Quantum Computing is the promise of greatly 
increased computing capability duo to the inherently parallel nature of data 
storage and computation resulting from the superposition principle of quan- 
tum mechanics. For example, it has been theoretically proven that certain 
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algorithms requiring exponential time on a classical computer can be solved 
in polynomial time on a quantum computer 82J. We shall say no more about 
Quantum Computing in this chapter and we turn to now the specific ma- 
trix problem to be solved. Further background material can be found in 
Refs. [831 [M [Ml [Ml [87] and references therein. 

The algorithm we present is based on the techniques of Conformal Com- 
puting: a rigorous mathematical approach to the construction of computer 
programs based on an algebra of abstract data types {A Mathematics of Ar- 
rays) and an indexing calculus (the ip- Calculus). One of the most important 
aspects of this approach is the ability to compose a sequence of algebraic ma- 
nipulations in terms of array shapes and direct indexing. The net result is 
the elimination of temporary arrays, which leads to significant performance 
improvements. Another important point of this approach is that the math- 
ematics used to describe the problem is the same as that used to describe 
the details of the hardware. Thus at the end of a derivation the resulting 
final expression can simply be translated into portable, efficient code for im- 
plementation in hardware and/or software. Another important attribute of 
the Conformal Computing approach is the ability to mathematically prove 
that the resulting implementation is maximally efficient given a set of met- 
rics (e.g. speeds of memory levels, processors, networks, etc). The details of 
this approach have been presented in detail elsewhere in Refs. [SOl [16] and in 
Chap. M 

The reader should not be misled by the name Conformal Computing. Con- 
formal in this sense is not related to Conformal Mapping or similar constructs 
from mathematics although it was inspired by these concepts in which certain 
properties are preserved under transformations. In particular, by Conformal 
Computing we mean a mathematical system that conforms as closely as pos- 
sible to the underlying structure of the hardware. 

6.3 Quantum Evolution: the Density Matrix 

Linear Algebra is the natural context in which to describe operations in a 
quantum computer and the central quantity is the density matrix. The density 
matrix provides a complete description of the time evolution of the quantum 
system. Changes to the system under the application of gating operations 
are represented as unitary transformations of the density matrix. In practice 
these transformations are carried out by multiplying the density matrix on 
the right and left by a unitary matrix and its inverse respectively. In general, 
for any given operation, we require only a sparse collection of elements from 
the density matrix to be rearranged. The specific arrangement of the required 
elements in the matrix depends on which spin or collection of spins are being 
manipulated. 

We propose herein a sparse matrix algorithm that eliminates the need to 
store large sparse unitary matrices corresponding to specific gating operations. 
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In effect we use direct indexing to effectively move the required density matrix 
elements onto the diagonal to achieve block-diagonal form. Then the gating 
operations are applied to these elements in a simplified form. Note, we are 
not actually moving elements of the density matrix around but rather we are 
carrying out such operations virtually through direct indexing. The net result 
is an algorithm requiring fewer fioating point multiplies and less storage. 

We focus on the following problem. Given the density matrix, for an arbi- 
trary quantum operation on an arbitrary number of states (qubits) we wish 
(for computational convenience) to rearrange the data so as to place the re- 
quired elements on the diagonal in block-diagonal form. Using the techniques 
of Conformal Computing we have found a way to do this in one step. 

In the following we consider 16 x 16 matrices corresponding to a situation 
in which we have four qubits and we are considering only operations involving 
two qubits at a time. We choose this situation merely for the purpose of 
illustration. The algorithm that we present in this chapter is applicable to 
arbitrary numbers of qubits (i.e. 2" for some non-negative integer n) and 
arbitrary collections of qubits to be gated at any one time. 

Consider the following possible arrangements for a 16 x 16 density matrix 
for which we wish to manipulate two spins (qubits). 



010101010101010101010101010101010101010101010101 
001100110011001100110011001100110011001100110011 
000011110000111100001111000011110000111100001111 
xxab 000000001001010010011011100100101101110110111111 

0000 abed 

0001 e f g h 

0010 i j k 1 

0011 m n o p 



0100 abed 

0101 e f g h 

0110 i j k 1 

0111 m n o p 

1000 abed 

1001 e f g h 

1010 i j k 1 

1011 m n o p 

1100 abed 

1101 e f g h 

1110 i j k 1 

1111 m n o p 



010101010101010101010101010101010101010101010101 
001100110011001100110011001100110011001100110011 
000011110000111100001111000011110000111100001111 
xaxb 000000001001010010011011100100101101110110111111 



0000 a b c d 

0001 e f g h 

0010 a b c d 

0011 e f g h 

0100 i j k 1 

0101 m n op 

0110 i j k 1 

0111 m n op 

1000 a b c d 

1001 e f g h 

1010 a b c d 

1011 e f g h 

1100 i j k 1 

1101 m n op 

1110 i j k 1 

1111 m n op 



Fig. 6.1. Accessing qubits. We wish to rearrange a pattern such as that on the right 
into one such as that on the left virtually using direct indexing. 



We wish to rearrange a pattern such as the one on the right into a block- 
diagonal form such as that on the left. The transformation is effected by 
viewing the 2^ x 2" density matrix as a 2^" dimensional hyper-cube and 
carrying out a certain rearrangement of the indices of the hyper-cube. Note 
that the labeling of the entries in the matrix corresponds to the indices of 
the hyper-cube. These indices ~ the hyper-cube coordinates - are simply the 
integers in the base-2 representation. 

For a 2"' X 2^ density matrix _D, we say that the shape of the array (a 
vector giving the lengths of its dimensions) is pD —< 2^^ 2^^ > (see Chap. [5] 
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for elements of the theory). Now we reshape the array into a hyper-cube Df, . 
Now the shape of the hyper-cube is a vector of 2" 2's, that is, < 2 2 • • ■ 2 >. 

The rearrangement we seek is a certain permutation of the indices of the 
hyper-cube. We write the block-diagonal matrix Db as Db = (pD) p {t(^Dfi), 
where the vector t is a permutation vector and the operator corresponds 
to transposing the indices of the hyper-cube as specified by the permutation 
vector. For the specific example above, we have t =<0 2134657 >. In the 
following, we present an algorithm for determining the general permutation 
vector. 



6.4 The Algorithm 

We want to view the qubits as coordinates in a hyper- cube, that is formed by 
reshaping (restructuring) the density matrix. 

Based on which qubits we wish to gate, i.e. move to the diagonal, we 
desire to create a permutation vector that permutes the indices of the hyper- 
cube. This is accomplished by applying MoA's binary transpose (see Chap. [2]). 
Consequently, all gated bits are moved to the diagonal in a block fashion. The 
blocks are square, i.e. 2'? x2^. with q denoting the number of qubits. Note that 
the design scales to multiple density hyper-cubes, and processing of 
each gate could be performed in parallel. We'll first look at an explicit 
example, i.e. a 16 x 16 density matrix where we gate various combinations of 
2 qubits. Thus we start with a 2^* x 2'* density matrix which gets restructured 
to a 2® hyper-cube. That is, we structure the density matrix, denoted by D 
to be a hyper-cube, Dh, defined as follows: 

Dh^{<8> p2)p D, (6.1) 

where the reshape operator p changes the shape of the array. Consequently, a 
2-dimensional structure has been transformed into an 8-dimensional structure. 
We want to now develop an algorithm that transposes Dh and by doing so 
eliminates the need for numerous permutation matrices, i.e. matrices that 
must be multiplied with the density matrix, D, to have the effect of moving 
the desired qubits to the diagonal. 

Let's begin with 4, 4 x 4 diagonal block matrices. That means that bits 
and 2 are gated (counting from right to left), i.e. the pattern in Fig. 16.21 
Consequently, we will use ah to denote which bits to gate in the density matrix. 
Note that here we talk about ah or 2 bits. In general, we'll say ra bits. This 
means that whenever we want to reference a patterned sparse array of gate 
indices, we must permute them such that they are on the diagonal, blocked as 
4x4 sub-matrices. Figures, 16.21 and 16.31 demonstrate some of the ways two 
bits may be gated. Notice also what we have done is transform row, column 
indices of D into their base 2 representation. We will now show how to combine 
the later to form coordinates. 



6.4 The Algorithm 
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Fig. 6.2. The xaxb permutation. 
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Fig. 6.3. The xabx permutation. 



When an algorithm is defined nsing MoA and reduced using it's '0-calculus, 
a normal form is revealed. When processor/memory/ ... / hierarchies are 
added, i.e. increasing the dimension of the algorithm^ the iteration space and 
data flow over each level is also normalized. Consequently, we can describe 
the physics using a data structure indicative of its quantum nature. The 
same algebra can partition the problem into blocks indicative of the pro- 
cessor/memory/.../ hierarchy used for execution. 

6.4.1 Assumptions in the Example: 

Gated Bits: Let a and b denote which bits to gate. In our examples we look 
at the following bit patterns: 
• xaxb: bits and 2, fPig. 
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• xabx: bits 1 and 2, (Fig. 
There are others, but for the purpose of illustration we will only consider 
the above patterns. 

Bit Ordering: Bits are numbered from right to left. That is, 1 1 1 is used 
to evaluate its decimal equivalent: 

(1 * 2^) + (1 * 2^) + (1 * 2^) + (0 * 2"). 

Vector Ordering: Indexing is numbered from left to right That is, a vector, 
< 1110 > when indexed would yield: 

< 1110 > [0] = 1, 
when the 0th index is accessed and, 

< 1110 > [3] = 0, 

when accessing the third. 
Example (cont.): From qubits to permutation vector: 

bits 0,2: xaxb ^3210 3210 bit ordering 

0123 0123 index ordering 

0213 0213 swap bits 1 and 2 

<02134657> is the transpose vector 



bits 1,2: xabx ^3210 3210 bit ordering 

0123 0123 index ordering 

0132 0132 swap bits 2 and 3 

0312 0312 swap bits 1 and 2 

<03124756> is the transpose vector 



Let t denote a transpose vector. Thus, given an arbitrary index vector i 
s.t. <* i <* 22" of the DhB 

composed as above, we permute all indices by 

t, i.e. the permuted index: 

i[t]=tQi (6.2) 

acting on the original array is equivalent to the index i acting on the permuted 
array which, consequently, never needs to be materialized. The permutation 
vector t effectively moves all gated indices to the diagonal. 

Indices are calculated and addressed directly from the original array 
stored in memory thus, eliminating intermediate arrays or permutation ma- 
trices. We have shown that we can algebraically represent the physics, alge- 
braically describe an all-at-once operation that is algebraically decomposable 



^ The notation (7/*v and v/*a denotes a binary operation /s.t. / is a boolean 
operation, e.g. < or <, and a is applied point-wise to each component of v, e.g. 
a < v[i] V < i < (tti), denotes comparison of the elements of the vector i. 
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to present and future architectural platforms (even quantum). The algebra 
remains the same throughout the problem, the decomposition over proces- 
sor/memory/FPGA, the mapping, and the architectural abstraction, with 
verifiable designs. 

6.4.2 Final Expression and Normal Form 

Given a density matrix D such that the shape is given by: pD =< 2n 2n >, 

we restructure to a hyper-cube, Dh- Let s denote the shape of Dh s.t. 

s =< (<2n> p 2) > (i.e. the shape is a vector consisting of 2n, 2's). Then 



Using t, the transpose vector previously defined, perform a binary transpose: 



Now, all matrices defined by bits chosen are on the diagonal. Note: Restruc- 
turing back to D and indexing creates no new arrays because of ^/^-reduction. 
Now V^-reduce to normal form Generic Design. 



This is the Generic Normal Form. Equation 16.41 denotes the address of an 
element of the density matrix D in terms of the address of the first element 
@D plus the offset 7(i[t];s). The quantity 7(i[t];s) is the polynomial that 
generates an address from the index vector and the shape. 

6.5 Conclusion 

We have presented a general algorithm for increasing the efficiency of certain 
key operations that arise in the solution of the evolution equations for a quan- 
tum system in the density matrix formalism. In particular, standard matrix 
multiplication operations that arise in the description of the quantum gating 
operation have been eliminated. The equivalent operation is now represented 
in terms of a direct indexing of the appropriate matrix elements. The present 
approach will increase the efficiency of simulations of quantum computers 
through the elimination of temporary arrays and parallel processing. 

In effect we use direct indexing to effectively move the required density 
matrix elements onto the diagonal to achieve block-diagonal form. Then the 
gating operations are applied to these elements in a simplified form. Note, we 
are not actually moving elements of the density matrix around but rather 
we are carrying out such operations virtually through direct indexing. The 
net result is an algorithm requiring fewer floating point multiplies and less 
storage. 



Dh^spD 



Vi s.t. < *i < * <(<2n> p2)>, 
iij{t D) = @D + j{i [t];s) 



(6.3) 
(6.4) 
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7.1 Conclusions 

We have presented a self-contained introduction to the methods of Conformal 
Computing and have iUustrated their use. The introduction presented a survey 
of the history of these techniques that have been developed over the past three 
decades. We also presented a review of our published work in which we found 
considerable performance gains of important algorithms such as the FFT in 
comparison with well-tuned library routines. Considerable reference material 
was also provided illustrating a variety of applications to algorithms of interest 
to science and engineering. 

In Chap. [2] a survey of the Mathematics of Arrays (MoA) and V'-calculus 
was given. These techniques are the two cornerstones of the Conformal Com- 
puting approach. The MoA is similar to the algebra of the APL programming 
language. Indeed, MoA was inspired by APL and was an outgrowth of research 
built on Sylvester's Universal Algebra. The MoA represents a substantial im- 
provement over APL's algebra, however, in that a number of mathematical 
anomalies have been corrected in the MoA's theoretical foundations. We now 
wish to emphasize an important point: Conformal Computing is NOT 
APL and only has algebraic similarities in common with APL. In 
particular there is no concept of an indexing calculus in APL AND 
MoA is NOT a programming language. It is a Mathematical Theory. 
Conformal Computing's indexing calculus (i.e. the V'-calculus) is a crucial as- 
pect of our approach that facilitates the construction of efficient computer pro- 
grams through the derivation of the Operational Normal Form (ONF) from the 
Denotational Normal Form (DNF). This process, called -^-reduction is unique 
to the Conformal Computing approach and has been successfully exploited 
in a number of important contexts. Neither APL nor any other programming 
language contain the ■(/'-calculus. Other languages, including APL, e.g. Fortran 
95, ZPL, etc. contain indexing rules NOT theories and none can be reduced 
to a normal form that allows one to prove the equivalence of programs. 
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The present monograph represents the most complete and self-contained 
account of the Conformal Computing approach to date. The bulk of this mono- 
graph is contained in Chaps. [3]through[S]in which previously unpubhshed work 
is presented. This work illustrates the techniques in extensive detail and serves 
as an in-depth look at the workings of the Conformal Computing approach. 
Chapter [3] presents a new cache- optimized FFT primarily in the language 
of traditional linear algebra. As such it bridges the gap between traditional 
mathematics of science and engineering and the MoA and ■(/'-calculus. In ad- 
dition, Chap. [3] demonstrates the importance of the new algorithm in which 
speedups on the order of factors of 2 to 4 are achieved. We emphasize that 
these are improvements over our previously pubhshed work that was shown 
to be competitive or superior to well-tuned library routines. jl6( I19j 

The full machinery of the Conformal Computing approach, as applied to 
our cache- optimized routine, is presented in Chaps. [HandO Chapter|3]presents 
details of the ^'-reduction process for two key steps in the algorithm: (1) the 
reshape-transpose operation and the (2) final transpose operation. 

We are finding that it is often convenient to work with multi-dimensional 
arrays that have been restructured {reshaped) as hyper-cubes. In this approach, 
an algorithm is developed and expressed as certain operations on the index 
vector of the hyper-cube. The hyper-cube approach to the cache- optimized 
FFT is presented in detail in Chap. [5] 

The power of the hyper-cube approach is illustrated in another example in 
Chap.[6]in the context of a quantum simulator. In this example, an algorithm 
on the hyper-cube index vector was developed that allows one to efficiently 
select arbitrary sparse collections of matrix elements and virtually move 
them to achieve a block-diagonal form for further processing. 

The ability to eliminate unnecessary temporary arrays and to compose 
algorithm steps leads to efficient code. The ONF is a mathematical prescrip- 
tion for how the code is to be built in any language of choice for software 
and hardware applications. Our approach uses a common formalism to rep- 
resent both the computational problem as well as the underlying hardware. 
In this way it becomes possible to mathematically reason about optimal per- 
formance of a given algorithm in a given computational environment given 
a set of performance metrics (e.g. speed of the various levels of the mem- 
ory/processor/network hierarchy). The cache- optimized FFT is an example 
of this in which details of the hardware (e.g. the cache size) are specified 
parametrically at run time. Natural extensions of this approach to include 
processors, networks, etc. are possible and are the subject of future research. 
As architectures get smaller, eventually quantum, issues such as heat, power, 
etc. will become essential parameters that Conformal Computing can address. 
The authors have attempted to present Conformal Computing in as much de- 
tail as possible and in a tutorial style so as to enable others to apply these 
powerful techniques to their own research. 
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7.2 Grand Challenges and Future Directions 

The Conformal Computing approach is poised to provide solutions to a num- 
ber of important problems facing the field of large-scale and embedded com- 
puting. However, Grand Challenges in Computational Science are a necessary 
prerequisite. Such Challenges include: 

• The Ability to Prove Two Executables are Equivalent. 

• The Ability to Provide Intentional Information Regarding Loops, Vari- 
ables, etc. to a Translator (or Compiler) Such that Various Instatiations 
are Built, e.g. OpenMP, MPI, forks, threads, etc. 

• Tools to Theorize About Performance Prior to Building Code. 

• Identifying Which Array (Matrix, Tensor) Based Algorithms are Common 
Across Domains: LU, QR, etc. 

• Develop Theories Similar to Conformal Computing. 

• Determine the Theoretical Foundations Behind Expression Templates Such 
that an Interface or Tool can be Built to Interface Normal Forms to How 
to Build Instantiations. 

• Determine how Languages Such as Matlab and Fortran 90 Can Provide 
MoA and Psi Calculus to Users. 
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